AI Agent Evaluation: The Complete Process We Run on Every Production Agent

Offline Evaluation for AI Agents: Experiments, Regression Tests, Backtesting

How offline evaluation works for AI agents: running experiments, the four use cases (benchmarking, unit tests, regression tests, backtesting), and how to catch regressions before they ship. Part of our evaluation process.

Offline evaluation is "test before you ship." You run your evaluators against your dataset to produce an experiment — a measurement of how one version of your agent performs on a known set of test cases. This is what lets you change your agent with confidence instead of hope.

The core unit: an experiment

An experiment is the result of evaluating a specific version of your agent on a specific dataset. It pairs:

One agent version (a prompt, a model, a tool config, a graph topology)
One dataset (your examples with reference outputs)
Your evaluators (the scoring functions)

Run them together and you get an experiment: a score per example, aggregated into metrics you can compare. Change one thing about your agent, re-run, and you have a second experiment. Compare the two and you know whether the change helped.

This is the entire point of offline evaluation: knowing whether a change made your agent better or worse, before users find out.

The four use cases for offline evaluation

The same experiment machinery serves four distinct purposes. Knowing which one you are doing keeps you honest about what the numbers mean.

1. Benchmarking — comparing versions

You run experiments across multiple agent versions and compare. Which prompt scores higher? Does the new model beat the old one on your tasks? Is the cheaper model good enough? Benchmarking answers "which version should we ship?"

The key discipline: change one variable at a time. If you change the prompt and the model and the retrieval config all at once, the experiment tells you the combined effect but not which change caused it. You learn nothing transferable.

2. Unit tests — validating discrete behaviors

You write small, focused datasets that test one specific behavior: "does the agent always refuse out-of-scope requests?", "does it always include a citation when it makes a factual claim?". These run like software unit tests — fast, focused, pass/fail.

Unit tests are where code/heuristic evaluators shine. The criterion is objective, so the test is deterministic and cheap.

3. Regression tests — catching degradation

This is the most important use case and the one most teams neglect. A regression test runs your full eval suite on every change and fails the build if the score drops. It is the safety net that stops a "small prompt tweak" from silently breaking three other behaviors.

The pattern:

On every deploy (or PR):
  1. Run the full eval suite against the current dataset
  2. Compare scores to the last known-good baseline
  3. If any metric dropped beyond threshold → fail the build
  4. Surface WHICH examples regressed, with diffs

Without regression testing, every change to your agent is a gamble. With it, you ship changes knowing they did not break anything you were already measuring. This is the single highest-ROI piece of evaluation infrastructure for a team shipping fast.

4. Backtesting — evaluating against historical data

You run a new agent version against historical production inputs and see how it would have performed. Backtesting answers "if we had shipped this version last month, would those failures have happened?" It is especially powerful for validating that a fix actually resolves the failures it was meant to, using the real inputs that triggered them.

The offline evaluation workflow

Putting it together, the offline loop is four steps:

Create the dataset — from production traces, ideally (see the datasets chapter).
Define the evaluators — the right type per criterion (see the evaluators chapter).
Run the experiment — execute your agent version against the dataset, score with the evaluators.
Analyze — benchmark vs other versions, check for regressions, drill into the examples that failed.

The analysis step is where the value is. A single aggregate score ("87% pass") tells you almost nothing. The useful output is: which examples failed, what category they fall into, and whether this version regressed against the last one. That is what tells you what to fix next.

The trap: optimizing the eval instead of the agent

A warning that comes straight from production audits. Once you have an offline eval with a number attached, there is a powerful pull to optimize that number rather than the agent's real-world behavior. Teams start tuning prompts to pass the eval, and the eval score climbs while production quality stays flat or drops.

This is the same failure UC Berkeley demonstrated for public agent benchmarks in 2026 — all eight major benchmarks were gameable to ~100% without the agents actually getting better. Your private eval is gameable the same way.

The defense: keep the dataset rooted in real production traces, refresh it continuously, and pair offline evals with online monitoring so production reality keeps your offline number honest.

What we do in an audit

We set up regression testing first — it is the highest-leverage piece of infrastructure for most teams and the one most often missing. Then we run a clean benchmark of your current version against your real dataset to establish a true baseline, and we backtest any fixes against the historical inputs that caused your known failures, so we can prove a fix actually works.

Previous chapter: AI agent evaluators · Next chapter: Online evaluation and production monitoring.

We run this step on every engagement

Book a free call. We scope the fix.

This is part of how we fix production AI agents. Book a free 30-minute call and we will scope yours, then commit to a result and work until we hit it. No pitch deck.