AI Agent Evaluation: The Complete Process We Run on Every Production Agent

Online Evaluation and Production Monitoring for AI Agents

How online evaluation works: scoring live production traffic in real time, reference-free evaluators, anomaly detection, alerting, and the feedback loop that catches silent agent degradation. Part of our evaluation process.

Online evaluation is "monitor in production." Where offline evaluation tests your agent against a known dataset before you ship, online evaluation scores real user interactions on live traffic, in real time. It is how you catch what your offline suite missed — and it is the component most teams skip entirely, which is why their agents degrade silently for weeks before anyone notices.

What changes when you go online

The fundamental difference: there is no reference output.

In offline evaluation, every example has a known-correct answer to compare against. In production, a real user sends a real input and your agent produces a response — and nobody knows the "right" answer. You cannot compare to a reference because there is none.

This single fact reshapes everything. Your evaluators have to score quality without a ground-truth answer. These are called reference-free evaluators, and they are the heart of online evaluation.

The core units: runs and threads

  • A run is a single execution trace — one input, the agent's full reasoning, every tool call, the intermediate steps, and the final output. In production, every user interaction generates a run.
  • A thread is a collection of related runs forming a multi-turn conversation. For conversational agents, the thread is often the right unit to evaluate, because quality emerges across turns, not within a single one.

Online evaluation runs evaluators against these production runs and threads automatically, as they happen.

Reference-free evaluators

Without a ground-truth answer, you score the things you can assess from the trace alone:

  • Groundedness — does the answer stay within what the retrieved context supports? (No reference needed; you check the answer against the context the agent actually retrieved.)
  • Format / schema validity — is the output well-formed? (Deterministic, no reference needed.)
  • Safety checks — does the output contain PII, policy violations, prompt-injection artifacts, toxic content?
  • Refusal correctness — did the agent correctly refuse an out-of-scope request, or did it answer something it should not have?
  • Tool-call validity — were the tool calls well-formed with valid arguments?
  • Quality heuristics — coherence, relevance to the question, completeness — scored by a calibrated LLM-as-judge in reference-free mode.
  • Trajectory sanity — step count, loop detection, cost per task. (See the metrics chapter.)

You will not catch everything without a reference — you cannot definitively know an answer was factually correct if you have no ground truth. But you can catch the overwhelming majority of production failures, because most production failures are ungrounded claims, malformed outputs, safety violations, wrong refusals, and runaway trajectories — all of which are detectable reference-free.

The four-step online workflow

  1. Deploy — your agent runs in production, generating runs and threads with no reference outputs.
  2. Configure evaluators — set your reference-free evaluators to run automatically against production traces (all of them, or a sampled subset if volume is high).
  3. Monitor in real time — track the evaluator scores as time series, with anomaly detection on each metric.
  4. Alert and loop — when a metric crosses a threshold, alert the team where they work; and feed failing traces back into your offline dataset.

Catching silent degradation

The failure online monitoring exists to prevent is agent decay — the agent shipped working, and two months later it is measurably worse, and no dashboard showed it.

Decay does not show up in traditional product dashboards (completion rate, response time, CSAT) until it is severe and customers are already complaining. It shows up in evaluation metrics — hallucination rate, tool-call accuracy, groundedness, cost per task — long before. Online evaluation watches those metrics continuously, so a 20% drift in groundedness over three weeks triggers an alert in week one, not a customer complaint in week four.

The metrics to watch as time series:

  • Hallucination / ungrounded-claim rate
  • Tool-call accuracy
  • Retrieval relevance score
  • Cost per resolved task
  • Step count per resolution (rising = the agent is working harder for the same result)
  • Refusal rate (sudden changes either direction signal a problem)

Wire anomaly detection on each, and route alerts to Slack, PagerDuty, or your incident tool — not a dashboard nobody opens. The whole value of online evaluation evaporates if the alert lands somewhere no one looks.

The feedback loop — closing the circle

This is the step that turns monitoring into improvement. When online evaluation catches a failing trace:

  1. The failing trace is captured with its full context.
  2. It is added to your offline dataset as a new example, with the correct output labeled.
  3. The next version of your agent is now evaluated against this exact failure in offline regression testing.
  4. The agent can never silently regress on this failure again — the regression test will catch it.

This loop is what separates an agent that gets more reliable over time from one that decays. Every production failure becomes a permanent test case. The agent's blind spots shrink with every incident instead of recurring.

What we do in an audit

The most common finding: there is no online evaluation at all. The team evaluated once before launch and has been flying blind since. We identify where reference-free evaluators should run, wire trajectory and groundedness monitoring into your observability stack (Langfuse, Braintrust, LangSmith, Helicone, OpenTelemetry — whatever you use), set anomaly alerts that reach a human, and build the feedback loop so failures become test cases. This is two of the eight pillars in our framework: observability and evals.

Previous chapter: Offline evaluation · Next chapter: Evaluation criteria and metrics that matter.

We run this step on every engagement

Book a free call. We scope the fix.

This is part of how we fix production AI agents. Book a free 30-minute call and we will scope yours, then commit to a result and work until we hit it. No pitch deck.