Failure mode library

Claude Opus 4.7 quietly raised your token bill ~35% — the migration audit nobody is doing

Anthropic shipped a tokenizer change in Opus 4.7 (April 16, 2026) that makes the same text consume ~35% more billable tokens. Most production agents migrated automatically. Here is how to measure your actual cost delta, the 4 known regressions, and the rollback-or-stay decision framework.

·Anthropic · Claude Agent SDK · Claude API

The symptom

You upgraded your production agent from Claude Opus 4.6 to Opus 4.7 sometime after April 16, 2026. The model is slightly slower, your token bill is noticeably higher, and at least one of these is true:

  • Long-context tasks (research, document synthesis, codebase reasoning) feel worse than they used to
  • Your eval pass rate dipped 2-5 points and your team blamed prompt drift
  • Your token spend went up but you cannot explain why

You did not change your prompts. You did not change your retrieval. You did not change your agent loop. The cost went up anyway.

What is actually happening

Anthropic shipped Claude Opus 4.7 on April 16, 2026 with a tokenizer change that affects how text is split into billable units. The same English prose now consumes roughly 35% more tokens on average. The exact ratio varies by content type — code is closer to flat, structured English text is closer to +35%, and certain non-English content is worse.

This is not a price-list change. The per-million-token rate is identical. The number of tokens consumed per request changed. Your bill went up because Anthropic re-counted what a token is.

If your agent was migrated to 4.7 by default (which most were — the API alias claude-opus-4 resolves to the latest), the cost increase happened the day Anthropic flipped the alias. No deployment on your end was required for your bill to rise.

The 3 things people try first that do not fix it

  1. Blame your team for prompt bloat. The prompts did not get longer. The token counter changed.
  2. Switch to Sonnet to save money. This works for some workloads but the capability drop is real. You are solving the wrong problem.
  3. Crank the temperature down to "save tokens." Temperature does not affect token count. This does nothing.

The 4 actual issues

Issue 1: Tokenizer rewrite — confirmed +~35% on average English

The tokenizer change is documented in Anthropic's official changelog. The practical effect: the average billable token count per request rose by about 35% for English prose, with the exact number depending on content shape. Reviews of Opus 4.7 from independent sources (including mindstudio.ai's detailed comparison) confirm the same range.

How to measure your specific delta:

  1. Pick 20 representative production prompts from your traces (last 7 days, sampled across user types)
  2. For each, count input tokens using the 4.6 tokenizer endpoint (or the local anthropic Python SDK with model="claude-opus-4-6" and the count_tokens method)
  3. Count input tokens using the 4.7 tokenizer endpoint (same call, model="claude-opus-4-7")
  4. Compute the ratio tokens_4_7 / tokens_4_6 for each prompt
  5. Average across the 20

A ratio above 1.3 means you are paying meaningfully more for the same prompts. A ratio above 1.4 means something specific to your prompt shape is amplifying the tokenizer change — typically lots of structured English narrative, JSON-with-prose, or markdown.

import anthropic

client = anthropic.Anthropic()

def cost_delta(prompts: list[str]) -> float:
    deltas = []
    for p in prompts:
        old = client.messages.count_tokens(
            model="claude-opus-4-6",
            messages=[{"role": "user", "content": p}],
        ).input_tokens
        new = client.messages.count_tokens(
            model="claude-opus-4-7",
            messages=[{"role": "user", "content": p}],
        ).input_tokens
        deltas.append(new / old)
    return sum(deltas) / len(deltas)

If cost_delta returns 1.35 across 20 prompts, your token bill is up ~35% on the input side alone. Output tokens are usually unchanged for the same task.

Issue 2: Instruction-following regression on multi-step prompts

Multiple independent reviews report a measurable decline in 4.7's ability to follow complex multi-step instructions versus 4.6. The pattern: a 6-step instruction list in the system prompt that 4.6 followed faithfully is now followed for steps 1-4 and skipped or reordered for steps 5-6.

Diagnostic: if your agent uses a system prompt with numbered instructions or a structured planner, run your eval suite against both 4.6 and 4.7 with no other changes. If pass rate drops on the multi-step tasks specifically, you have this regression.

Workaround: break the multi-step instruction into separate calls in your agent loop, with one instruction per call. This is more API calls and therefore more cost, but each call is small enough for 4.7 to handle reliably.

Issue 3: MRCR (multi-round context retrieval) drop on long contexts

Long-context retrieval — the ability to pull a specific fact from a 100k-token context window — is documented as weaker in 4.7 than in 4.6. The decline is small (single-digit percentage on most reviewer benchmarks) but consistent.

Why it matters for agents: if your agent stuffs retrieval chunks into the context window and asks Claude to ground its answer in those chunks, the grounding is less reliable in 4.7. You will see more answers that ignore the retrieved chunk and fall back to the model's prior. Your "hallucination rate" goes up even though retrieval is unchanged.

Diagnostic: sample 50 production responses. For each, ask: was the answer grounded in the retrieved chunk, or did the model extrapolate? If extrapolation rate is up post-4.7, you have this regression.

Workaround: move grounded retrieval to a more aggressive structure — explicit citation requirements in the system prompt, refusal on un-citable claims, or a separate verification step that checks each claim against retrieval.

Issue 4: Web research / search-mode quality decline

Reviewers report 4.7's web-research mode (when invoked via the search tool) produces shallower and sometimes less accurate summaries than 4.6 did. The effect is most visible on niche or technical queries.

If your agent uses Anthropic's built-in web search tool: test 4.6 vs 4.7 on 10 of your typical search-requiring queries. If 4.7 produces shorter or less specific results, you have this regression.

Workaround: if your workload depends on high-quality search results, consider routing search-mode calls to Opus 4.6 (still available via explicit model version pin) while keeping 4.7 for non-search tasks.

The migration audit — full procedure

Run this once on your production traffic to know your exposure:

Step 1: Pin the model version explicitly

Stop using the floating alias. In every Anthropic SDK call across your codebase, replace model="claude-opus-4" with an explicit version:

# Before — floats automatically to whatever is current
model="claude-opus-4"

# After — pins to the version you tested with
model="claude-opus-4-7-20260416"

This is the single most important change. Floating aliases mean Anthropic can change your cost basis and behavior without your deployment touching anything. Pinning gives you control over when you migrate.

Step 2: Measure your cost delta

Run the cost_delta script above against 20-50 production prompts. If the average is above 1.25, your migration was non-trivial. Document the number.

Step 3: Run your eval suite against both versions

Re-run your eval set against 4.6 and 4.7 in parallel. Compare:

  • Pass rate
  • Cost per task (which now includes the tokenizer change)
  • Latency

You now have a defensible decision matrix.

Step 4: Decide: roll back, stay, or split

The three legitimate decisions:

Roll back to 4.6 if:

  • Cost delta is above 1.3 AND
  • Eval pass rate dropped 2+ points AND
  • You do not need any 4.7-specific capability your eval set has not tested

Stay on 4.7 if:

  • Cost delta is below 1.2 OR
  • Eval pass rate is flat or improved AND
  • The 4.7 capabilities (reasoning effort tuning, new tools) justify the cost

Split — route by workload if:

  • Cost delta is large but uneven by task type
  • Some workloads improved on 4.7 and some regressed

Example split routing:

def pick_model(task_type: str) -> str:
    if task_type in ("long_context_retrieval", "multi_step_planning"):
        return "claude-opus-4-6-20251015"  # known better on these
    if task_type in ("code_generation", "structured_output"):
        return "claude-opus-4-7-20260416"  # roughly flat or better
    return "claude-opus-4-6-20251015"  # default to lower cost

What "good" looks like after the audit

A well-handled Claude migration ends with:

  • Every API call pins to an explicit model version (no floating aliases)
  • A documented before/after cost delta from real production traces
  • A decision matrix mapping workload types to model versions
  • Eval suite runs against both versions automatically on every deploy
  • Spend alerts wired to catch future silent migrations

If your team migrated to 4.7 automatically and you have not done this work, you are now paying ~35% more for an agent that is measurably worse at some of the things 4.6 was good at. That is not a hypothetical — it is the default outcome for anyone using floating aliases.

When to get help

If your team does not have time to run this audit, or you discover the cost delta is severe and you need help routing workloads correctly, this is exactly what we do. Free 30-min audit + written report in 24 hours — we measure your actual delta, run the regression evals, and tell you what to do.

The migration audit takes us about 4 hours. Most teams discover their bill went up by an amount that justifies the engagement many times over.

Stuck on this exact failure?

Hitting this exact failure? Skip the debugging.

Book a free 30-minute call. We scope where your agent is breaking and map the fix. Then we commit to the result and work until we hit it. No pitch deck, no obligation.