Guides

The processes we run, published openly.

In-depth, vendor-neutral guides to how we audit and harden production AI agents. The same processes we run on every engagement — written down so you can learn them, or hire us to run them on your agent.

AI Agent Evaluation: The Complete Process We Run on Every Production Agent

01Building Evaluation Datasets for AI Agents
02AI Agent Evaluators: LLM-as-Judge, Heuristics, Human Review, and Pairwise
03Offline Evaluation for AI Agents: Experiments, Regression Tests, Backtesting
04Online Evaluation and Production Monitoring for AI Agents
05AI Agent Evaluation Criteria and Metrics That Actually Matter