Our process

AI Agent Evaluation: The Complete Process We Run on Every Production Agent

The end-to-end process for evaluating production AI agents — datasets, evaluators, offline evals, online monitoring, and the metrics that matter. Vendor-neutral, framework-grounded, and the exact process we run on every fixmyagent.agency engagement.

Most teams ship an AI agent with no evaluation layer at all. It worked in a demo, so it went to production. The gap between "worked in the demo" and "works in production" is almost always an evaluation gap — there was never a systematic way to measure what the agent actually does on real traffic, so failures only surface when a customer hits one.

This guide is the complete evaluation process we run on every production AI agent we audit. It is vendor-neutral — the concepts apply whether you use LangSmith, Braintrust, Langfuse, Arize, or a homegrown harness. It is also the literal process behind our 8-Pillar Reliability Framework: three of the eight pillars (evals, observability, grounding) are built on what follows.

Read it as a buyer to understand what good evaluation looks like before you hire anyone. Read it as an engineer to set up your own pipeline. Either way, this is the canonical version, written from production audits, not theory.

The mental model: two modes, one loop

Every serious evaluation practice has exactly two modes, and they form a single continuous loop:

  1. Offline evaluation — "test before you ship." You evaluate your agent against a curated dataset during development, so you can compare versions, catch regressions, and know whether a change made things better or worse before it reaches users.

  2. Online evaluation — "monitor in production." You evaluate real user interactions on live traffic, in real time, so you detect issues, measure quality on the inputs your users actually send, and catch silent degradation before customers do.

The loop closes when failing production traces flow back into your offline dataset. A real failure your online monitoring caught becomes a new test case in your offline suite, so the next version of the agent is evaluated against the exact failure that hurt you. This feedback loop is the difference between an agent that gets more reliable over time and one that decays.

   ┌─────────────────── offline evaluation ───────────────────┐
   │  datasets → evaluators → experiments → analysis           │
   └───────────────────────────┬───────────────────────────────┘
                               │ ship the version that passed
                               ▼
   ┌─────────────────── online evaluation ───────────────────┐
   │  production runs → evaluators → monitoring → alerts       │
   └───────────────────────────┬───────────────────────────────┘
                               │ failing traces become new test cases
                               └──────────────► back to datasets

Everything else in this guide is the detail of how each box works.

The five components of an evaluation pipeline

Across both modes, an evaluation pipeline is built from five components. Each has its own chapter in this guide:

1. Datasets — what you test against

A dataset is a collection of test cases (called examples), each with an input and, for offline evals, a reference output. The single highest-leverage decision in your entire evaluation practice is where your dataset comes from. Most teams invent test cases from imagination; the teams whose agents survive production build their datasets from real production traces.

→ Read the full chapter: Building evaluation datasets for AI agents

2. Evaluators — how you score outputs

An evaluator is a function that scores how well your agent performed on an example. There are four kinds — human review, code/heuristic rules, LLM-as-judge, and pairwise comparison — and choosing the right kind for each criterion is what separates evaluation that correlates with reality from evaluation theater.

→ Read the full chapter: AI agent evaluators: LLM-as-judge, heuristics, human review

3. Offline evaluation — testing before you ship

Offline evaluation runs your evaluators against your dataset to produce an experiment — the result of evaluating one version of your agent on one dataset. Experiments power four distinct use cases: benchmarking versions against each other, unit-testing discrete behaviors, regression-testing to catch degradation, and backtesting against historical data.

→ Read the full chapter: Offline evaluation: experiments, regression tests, backtesting

4. Online monitoring — evaluating live traffic

Online evaluation runs evaluators automatically against production traces (called runs), with no reference outputs available. This is where you catch what your offline suite missed: the malformed inputs, the edge cases, the slow drift. It requires reference-free evaluators, anomaly detection, and alerting wired into where your team actually works.

→ Read the full chapter: Online evaluation and production monitoring for AI agents

5. Criteria and metrics — what "good" actually means

None of the above matters if you are measuring the wrong things. The criteria you choose — correctness, groundedness, tool-call accuracy, latency, cost per task, trajectory quality — determine whether your evaluation predicts production behavior or just produces a number that makes everyone feel safe.

→ Read the full chapter: AI agent evaluation criteria and metrics that matter

Why most evaluation fails (and what we look for in an audit)

When we audit a production agent, the evaluation failures we find cluster into four patterns:

  1. There is no dataset, or it was invented rather than observed. Either the team never built an eval set at all, or they wrote test cases by imagining how users behave instead of sampling how they actually behave. Either way production fails in ways the team had no way to see. The fix is always: build the dataset from real traces.

  2. The criteria measure outputs, not trajectories. The team scores whether the final answer looks right, but never measures whether the agent took a sane path to get there — which tools it called, how many steps, whether it looped. Output-only evaluation misses the failures that cost the most.

  3. There is no online evaluation at all. The team evaluated once, before launch, and never again. The agent has been silently degrading for weeks and nobody has the instrumentation to know.

  4. The loop is open. Even teams with both offline and online evals often never feed production failures back into the offline dataset. Each failure teaches them nothing because it never becomes a test case.

If any of those four sound like your setup, your evaluation practice has a gap — and that gap is almost certainly why your agent behaves differently in production than in testing.

How this maps to our engagement

This is not abstract. When you book a free audit, the first thing we do is run this exact process on your agent:

  • We pull your production traces and build a real dataset from them
  • We define the evaluators and criteria that match your specific failure modes
  • We run an offline experiment to establish your true baseline (usually lower than your internal number)
  • We check whether you have online monitoring, and if not, identify where it should go
  • We deliver a written report scoring all of this against the 8-Pillar Framework

The five chapters below are the playbook. The audit is us running it on your agent.

Free 30-min audit + written report in 24 hoursbook here.

This is the process we run on your agent

Want this run on your production agent?

Everything in this guide is what we do on every engagement. Book a free 30-minute call. We scope where your agent is breaking, then commit to a result and work until we hit it.