Why 74% of AI customer agents get rolled back — and what the 26% do differently

Sinch's May 2026 'AI Production Paradox' report found 74% of enterprises have rolled back a live AI customer service agent. Here are the four reasons it happens, drawn from production audits — and the practices the remaining 26% have in common.

May 19, 2026·Production AI agents · Customer-facing

Sinch published the AI Production Paradox report on May 13, 2026. The headline number is the one everyone is quoting: 74% of enterprises rolled back a live AI customer service agent at some point in the last 18 months. Sixty-two percent currently have one in production. Forty percent of all agentic AI projects will be cancelled by the end of 2027 (Gartner, June 2025).

These numbers are bad. They are also misleading on their own. The interesting question is not why so many fail — it is what the 26% that ship and stay in production are doing that the other 74% are not.

We audit production AI agents for a living. The patterns are remarkably consistent across the rollbacks we have diagnosed. Four causes account for almost all of them. None of them are model quality. All four are fixable.

1. The agent was evaluated on the wrong thing

The most common failure is also the most invisible. The team built an eval set, the eval set passes, the agent ships. Real users break it within a week. Support tickets double instead of halving.

What the eval set actually measured: whether the agent produces a plausible-looking answer to a well-formed question. What real customers do: ask malformed, multi-part, emotionally loaded, off-topic questions. The eval set never had those.

The 26% rebuild their eval set from production traces, not from imagination. They sample real conversations, label outcomes, and run the eval on the ugly stuff. The agent that passes that eval is the one that ships.

Diagnostic: pull 50 random transcripts from your last week. Have a human rate each outcome (resolved / wrong / harmful / escalated). If your offline eval scores do not predict that distribution within ±10%, your eval is not measuring the right thing.

2. The agent has no idea when to refuse

The rollbacks that make the news — Air Canada's chatbot inventing a refund policy, Cursor's support agent fabricating product limits — all share one cause. The agent did not know it was supposed to say no.

By default, a language model trained on the internet is a "helpful answer machine." It will produce an answer to almost any question, including questions it has no business answering. Production-grade agents are not built on "be helpful." They are built on strict scope plus refusal trajectories.

The 26% encode this explicitly:

Hard scope boundaries in the system prompt (Do NOT answer questions about: refunds, pricing tiers, scheduling outside business hours, …)
Few-shot examples of correct refusals, not just correct answers
A fallback route to a human, instrumented and measured

Diagnostic: ask your agent ten questions outside its scope. If it answers more than two without escalating, your scope discipline is broken.

3. The agent extrapolates beyond its retrieval

This is the failure mode most teams diagnose as "hallucination" and try to fix with a better prompt. The prompt is not the problem.

What is happening: the agent's retrieval step returns the right documents, but the answering step confidently extrapolates beyond them. The customer asks about Tier 3 pricing. The retrieval correctly returns the Tier 1 and Tier 2 pages. The model invents Tier 3 by extrapolating linearly.

The 26% solve this with two patterns:

Ground every factual claim against retrieval, not against the model's prior. Any answer that cannot be cited to a retrieved chunk gets refused.
Wire the agent to live entitlement / pricing services, not to a static documentation snapshot. The agent should not know tier rules. It should call for them.

Diagnostic: take your last 20 wrong answers. For each, ask: was the right answer present in retrieval? If yes, the failure was extrapolation, not retrieval — and that is a different fix than your team is probably applying.

4. Silent degradation. Nobody is watching the trajectory.

The agent shipped working. Two months later, it is worse — and the dashboards do not show it.

Why: traditional dashboards measure outputs (completion rate, response time, satisfaction surveys). Agent decay does not show up there until it is too late. It shows up in trajectories — the path the agent takes, the tools it calls, the steps it takes — long before the final output looks wrong.

The 26% instrument trajectory monitoring as a first-class discipline:

Trajectory traces logged for 100% of production traffic
Per-step accuracy measured (planner selects right next action; tool calls return valid arguments; retrieval returns relevant chunks)
Drift alerts wired into the on-call channel (Slack, PagerDuty, Datadog) — not buried in a monthly report

Diagnostic: if your team would learn about a 20% degradation in agent quality from a customer complaint or invoice, not from a dashboard or alert, you are operating without trajectory monitoring. That is the single biggest gap between the 74% and the 26%.

The pattern under all four

Read the four failures back. They share a structure:

The 74% measure outputs. The 26% measure trajectories.
The 74% rely on prompt engineering to fix things post-hoc. The 26% rely on evals + grounding + scope discipline to prevent them.
The 74% find out about problems from customers. The 26% find out from instrumentation.

This is not a model problem. Switching from GPT-4 to Claude 4.7 to Gemini 3 Pro does not fix any of these. They are engineering problems. They have known solutions. The 26% have done that engineering. The 74% have not.

What to do if you are in the 74%

Three actions, in order of leverage:

Pull last week's production traces and rebuild your eval set from them. This is the single highest-ROI move. If you only do one thing, do this.
Audit your refusal trajectories. Test ten out-of-scope questions and one adversarial prompt. Count the failures.
Add trajectory tracing to your observability stack. Langfuse, Braintrust, LangSmith, Helicone — they all do this, free or low-cost. Wire one in this week.

These three steps will not solve every reliability problem. They will solve enough that you stop being on the wrong side of the 26/74 split.

A note on what the Sinch report actually says

The report is worth reading in full — it includes survey data on why enterprises pulled their agents back, not just how many did. The top three reasons enterprises gave for rollbacks were "lack of trust in outputs" (61%), "unclear ROI" (54%), and "escalation rate too high" (47%). All three are downstream of the four failures above. None of them are model-quality issues.

The companies still running their agents in production after 18 months are not luckier. They are not on better models. They built the engineering layer the 74% skipped.

That engineering layer is what we do.

— Moazzam Qureshi, Founder, fixmyagent.agency

Working on a production AI agent?

Book a free call. We scope the fix.

A free 30-minute call with an AI expert. We find where your agent is breaking and map the path to fixing it. Then we commit to a result and work until we hit it. No pitch deck.