AI Agent Evaluation: The Complete Process We Run on Every Production Agent

AI Agent Evaluation Criteria and Metrics That Actually Matter

Which metrics to use when evaluating AI agents: correctness, groundedness, tool-call accuracy, trajectory quality, latency, and cost per task. Why output-only metrics miss the failures that cost the most. Part of our evaluation process.

You can have a perfect dataset, well-chosen evaluators, and clean offline and online pipelines — and still learn nothing, if you are measuring the wrong things. This chapter is about choosing criteria that predict production behavior instead of producing a number that makes everyone feel safe.

The fundamental split: output metrics vs trajectory metrics

The most important distinction in agent evaluation:

Output metrics measure the final answer. Was it correct? Grounded? Well-formatted? Appropriately toned?
Trajectory metrics measure the path the agent took to produce the answer. Which tools did it call? In what order? How many steps? Did it loop? How much did it cost?

Most teams measure only outputs. This is the single most common evaluation gap we find in audits. An agent can produce a correct final answer while doing something insane to get there — calling the same tool 14 times, looping for 40 steps, burning $3 of tokens on a question that should cost $0.02. Output-only evaluation scores that as a pass. It is not a pass. It is a production incident waiting for scale.

Agents are defined by their trajectories. Evaluating only their outputs is like grading a pilot solely on whether the plane landed, ignoring that they nearly stalled twice on the way down.

The criteria that matter (and which kind of metric each is)

Output criteria

Correctness — is the answer right? (Needs a reference or a calibrated judge; the hardest to measure reference-free.)
Groundedness — is every claim supported by retrieved context, with no extrapolation? (Measurable reference-free — the highest-value online metric.)
Format validity — is the output well-formed and schema-valid? (Deterministic, cheap, always include it.)
Refusal correctness — did it correctly answer in-scope and refuse out-of-scope? (Catches the Air Canada / Cursor class of failure.)
Tone / safety — appropriate, no PII leakage, no policy violation, no toxic content.

Trajectory criteria

Tool-call accuracy — did it call the right tool with valid arguments? Target ≥95% on your eval set for any agent that takes actions.
Step count per resolution — how many steps to resolve a task. A rising trend is an early warning of degradation.
Loop detection — did it revisit the same state or repeat the same tool call without progress? (See the LangGraph recursion failure mode.)
Cost per task — total tokens (and dollars) to resolve one task. The metric your CFO cares about and most teams never track per-task.
Latency per resolution — end-to-end time, including all tool calls and model round-trips.

The operational metric

Escalation / deflection rate — for customer-facing agents, what fraction resolve without a human, and is that trending the right way? This is the metric that ties evaluation to business outcome.

Choosing your criteria: the leverage rule

You cannot measure everything well, so prioritize. The rule: measure the criteria whose failure costs you the most.

For a customer-facing support agent, the costly failures are wrong answers to customers and incorrect refusals — so groundedness, correctness, and refusal correctness lead, with cost per task close behind.

For an internal autonomous pipeline, the costly failures are loops and runaway cost — so trajectory metrics (step count, loop detection, cost per task) lead.

For a legal or financial agent, the costly failure is a confident hallucination — so groundedness and correctness dominate, measured with human review at high stakes.

Map your criteria to your expensive failures. A generic metrics dashboard that measures everything equally measures nothing usefully.

The metric that lies: aggregate pass rate

A single "87% pass rate" is the most dangerous number in evaluation. It hides everything that matters:

Which 13% failed? If it is the high-stakes cases, 87% is a disaster. If it is trivial edge cases, 87% is fine.
Did the failures cluster in one category (all the refund questions)? That is a fixable pattern, invisible in the aggregate.
Is 87% up or down from last version? The number alone is meaningless without the trend.

Always decompose. Report pass rate per category, track it over time, and surface the specific failing examples. The aggregate is a headline, not an answer.

How metrics tie the whole process together

The five chapters of this guide form one system, and metrics are what flow through it:

Your dataset provides the examples.
Your evaluators score them against these criteria.
Offline evaluation tracks the metrics version-over-version to catch regressions.
Online monitoring tracks the same metrics on live traffic to catch decay.
The metrics that regress in production become new test cases, closing the loop.

Choose the criteria that match your expensive failures, measure both outputs and trajectories, decompose every aggregate, and track everything as a trend. That is evaluation that predicts production.

What we do in an audit

We map your criteria to your actual expensive failure modes — most teams are measuring generic output quality and missing the trajectory metrics where their real risk lives. We instrument tool-call accuracy, step count, loop detection, and cost per task (almost always absent), and we decompose your pass rate by category so you can see the failures the aggregate was hiding. This is the measurement layer underneath the whole 8-Pillar Framework.

Previous chapter: Online evaluation and production monitoring · Back to the start: The complete AI agent evaluation process.

We run this step on every engagement

Book a free call. We scope the fix.

This is part of how we fix production AI agents. Book a free 30-minute call and we will scope yours, then commit to a result and work until we hit it. No pitch deck.