AI Agent Evaluation: The Complete Process We Run on Every Production Agent

Building Evaluation Datasets for AI Agents

How to build evaluation datasets for production AI agents: examples, reference outputs, and why your dataset should come from real production traces, not imagination. The dataset chapter of our AI agent evaluation process.

A dataset is the foundation of every evaluation. Get it wrong and nothing downstream matters — your evaluators will score the wrong things, your experiments will produce confident wrong numbers, and your agent will pass tests it should fail. This chapter covers what a dataset is, where it should come from, and the mistake that breaks 80% of the eval practices we audit.

What a dataset is

A dataset is a collection of test cases. Each test case is called an example, and an example has:

  • Input — what gets sent to your agent (the user message, the context, any parameters)
  • Reference output — for offline evaluation, the known-correct or known-good answer to compare against. (Online evaluation has no reference output, which is a key distinction — covered in the online monitoring chapter.)
  • Metadata — optional tags: user type, difficulty, failure category, source. This lets you slice results later ("how does the agent do on refund questions specifically?").

That is the whole structure. The complexity is not in the format. It is in where the examples come from.

The three sources of examples

There are exactly three places a dataset comes from, and they are not equally valuable.

1. Production traces (highest value)

You sample real interactions from your live agent and turn them into examples. This is the gold standard because the inputs are real — the malformed questions, the multi-part requests, the off-topic detours, the emotional edge cases your users actually send.

Most teams skip this because it requires having production traffic and the instrumentation to capture it. But it is the single most important source. An evaluation dataset built from production traces predicts production behavior. One built from anything else does not.

2. Manual curation (medium value)

You write examples by hand, usually to cover specific cases you know matter — known edge cases, compliance scenarios, adversarial inputs, the "this must never happen" cases. Manual curation is valuable for coverage of known risks, but it cannot anticipate the failures you have not imagined yet.

Use manual curation to encode the failures you already know about. Use production traces to discover the ones you do not.

3. Synthetic generation (variable value)

You use an LLM to generate examples — paraphrases of existing inputs, variations on a theme, adversarial cases. Synthetic data is useful for scaling coverage once you have a real seed set, and for stress-testing specific behaviors. It is dangerous as a primary source because synthetic inputs reflect what a model thinks users do, which is exactly the wrong distribution.

The right recipe: seed from production traces, cover known risks with manual curation, scale with synthetic generation. In that order.

The mistake that breaks most eval practices

Here is the failure we find most often when we audit a production agent:

The team built their dataset by imagining how users would behave. The dataset passes. Production fails. They cannot understand why, because their eval is green.

This is the imagined-dataset trap. The team's examples are all well-formed, on-topic, single-intent questions — because that is how a human imagines a user behaves when they sit down to write test cases. Real users send malformed, multi-part, emotionally loaded, ambiguous, off-topic inputs. The agent was never evaluated against any of those. So the eval score is meaningless.

The diagnostic: pull 50 random transcripts from your last week of production. Compare them to your eval dataset's examples. If your real traffic looks nothing like your test cases, your dataset is imagined, and your eval scores do not predict anything.

The fix: build the dataset from production traces. (If you have no eval dataset at all — which is the common case — this is also how you build your first one.) Sample at least 50-100 real interactions, stratified across user types and outcomes (resolved, wrong, escalated, harmful). Label each. That becomes your dataset. If you already had one built from imagination, watch your pass rate drop against the real data — that drop is the truth the imagined dataset was hiding.

How big should a dataset be?

There is no magic number, but practical guidance from production work:

  • Minimum viable: 50 examples. Below this, your scores are too noisy to trust version-over-version comparisons.
  • Solid: 200-500 examples, stratified across the categories that matter for your agent.
  • Mature: 1,000+, with subsets tagged by failure category so you can run targeted evals ("just the tool-calling cases").

Bigger is not automatically better. A 100-example dataset built from real production traces beats a 5,000-example dataset of synthetic paraphrases every time. Quality of distribution beats quantity.

Keeping datasets alive

A dataset is not a one-time artifact. It decays as your agent and your users evolve. The teams whose agents stay reliable treat the dataset as a living asset:

  • Every production failure caught by online monitoring becomes a new example.
  • Every new feature ships with new examples covering it.
  • Stale examples (testing behavior you have deprecated) get pruned.

This is the feedback loop from the main evaluation guide: production failures flow back into the dataset, so the next version is tested against the exact thing that hurt you last time.

What we do in an audit

When we audit your agent, the dataset is the first thing we rebuild. We pull your production traces, sample a stratified set, label the outcomes, and run your existing evaluators against this real dataset. The gap between your internal eval score and the score against the real dataset is usually the first hard truth of the engagement — and the first thing we fix.

Next chapter: AI agent evaluators — how to score outputs.

We run this step on every engagement

Book a free call. We scope the fix.

This is part of how we fix production AI agents. Book a free 30-minute call and we will scope yours, then commit to a result and work until we hit it. No pitch deck.