AI Agent Evaluation: The Complete Process We Run on Every Production Agent

AI Agent Evaluators: LLM-as-Judge, Heuristics, Human Review, and Pairwise

The four types of AI agent evaluators — human review, code/heuristic rules, LLM-as-judge, and pairwise comparison — when to use each, and how to calibrate an LLM-as-judge so it correlates with human judgment. Part of our evaluation process.

An evaluator is a function that scores how well your agent performed on an example. Choosing the right type of evaluator for each thing you want to measure is what separates evaluation that correlates with reality from evaluation that produces comforting but meaningless numbers. There are four types.

The four types of evaluator

1. Code / heuristic evaluators

A deterministic function that checks something objective: does the output contain a valid JSON object? Is the dollar amount within range? Did the agent call the right tool? Does the response match a regex? Is the latency under the SLA?

Use when: the criterion is objective and machine-checkable. These are fast, free, deterministic, and should always be your first line. If a thing can be checked with code, check it with code — never spend an LLM-as-judge call on something a regex can verify.

Examples: format validation, tool-call accuracy, presence of required citations, refusal detection, cost-per-task thresholds, schema conformance.

2. LLM-as-judge evaluators

You use a language model to score an output against a rubric. The judge model reads the input, the output (and optionally a reference), and returns a score plus a rationale: "Is this answer grounded in the provided context? Score 1-5 with reasoning."

Use when: the criterion is subjective or semantic — correctness, helpfulness, tone, groundedness, coherence — and cannot be reduced to a deterministic rule.

LLM-as-judge is the most powerful and the most dangerous evaluator type. Powerful because it can score things no regex can. Dangerous because an uncalibrated judge produces confident, wrong, expensive scores. Calibration is covered below — it is the part most teams skip and the part that matters most.

3. Human review evaluators

A human scores the output. The ground truth. Slow, expensive, does not scale — and irreplaceable, because it is the only thing your LLM-as-judge can be calibrated against.

Use when: establishing ground truth for calibration, scoring high-stakes outputs (legal, medical, financial), or auditing your automated evaluators to confirm they still correlate with human judgment.

The mistake teams make is treating human review as a permanent bottleneck instead of a calibration tool. You do not human-review every output forever. You human-review enough to calibrate an LLM-as-judge, then let the judge scale while you periodically re-audit it against fresh human labels.

4. Pairwise comparison evaluators

Instead of scoring one output in isolation, you compare two outputs (from two agent versions, or two prompts) and pick the better one. "Version A vs Version B — which answer is better, and why?"

Use when: absolute scores are noisy but relative judgments are reliable — which is most subjective criteria. Humans (and LLM judges) are far more consistent at "A is better than B" than at "A is a 7 out of 10." Pairwise is the most reliable way to know whether a change actually improved your agent.

How to calibrate an LLM-as-judge (the part everyone skips)

An LLM-as-judge that has not been calibrated against human judgment is not an evaluator. It is a random number generator that happens to output numbers in the right range. Here is the calibration process:

Step 1: Collect human labels

Take 50-100 examples. Have a human (ideally a domain expert) score each one against the exact rubric you will give the judge. These human scores are your ground truth.

Step 2: Run the judge on the same examples

Give the judge the identical rubric. Run it against the same 50-100 examples. Now you have two scores per example: human and judge.

Step 3: Measure agreement

Compute the correlation between human and judge scores. For categorical judgments (pass/fail), use agreement rate or Cohen's kappa. For numeric scores, use correlation. If the judge agrees with humans less than ~80% of the time, the judge is not usable yet.

Step 4: Fix the rubric, not the model

When the judge disagrees with humans, the cause is almost always a vague rubric, not a weak model. "Score the helpfulness 1-5" is unusable — what does a 3 mean? Rewrite the rubric with explicit anchors:

Score the answer's GROUNDEDNESS against the provided context:

5 = Every claim is directly supported by the context. No extrapolation.
3 = Main claims supported, but one minor detail is extrapolated beyond
    the context.
1 = The answer contains claims not present in the context (hallucination).

Return: {"score": <1-5>, "reasoning": "<which claims are/aren't grounded>"}

A judge with explicit anchors and a forced rationale agrees with humans dramatically more than one given a vague instruction.

Step 5: Re-audit periodically

Models change (see what happened with Claude Opus 4.7). A judge calibrated three months ago may have drifted. Re-run the human-agreement check quarterly, and any time you change the judge model.

The evaluator selection rule

For each thing you want to measure, walk down this list and stop at the first that applies:

Can a deterministic rule check it? → code/heuristic evaluator. (Fastest, free, exact.)
Is it subjective but you need absolute scores? → calibrated LLM-as-judge.
Is it subjective and you are comparing versions? → pairwise comparison.
Is it high-stakes or needed for calibration? → human review.

Most production eval suites use all four: code rules for the objective stuff, a calibrated judge for semantic quality, pairwise for version comparison, and periodic human review to keep the judge honest.

What we do in an audit

We check whether your LLM-as-judge has ever been calibrated against human labels (usually it has not), measure its actual agreement rate, and rewrite the rubrics that are producing noise. We also catch the common waste — teams using an expensive LLM-as-judge for things a one-line code check would verify deterministically.

Previous chapter: Building evaluation datasets · Next chapter: Offline evaluation — experiments and regression tests.

We run this step on every engagement

Book a free call. We scope the fix.

This is part of how we fix production AI agents. Book a free 30-minute call and we will scope yours, then commit to a result and work until we hit it. No pitch deck.