Field notes

Anthropic's Dreaming is brilliant. Putting your evals inside your model provider is not.

Anthropic shipped Dreaming, Outcomes, and managed multi-agent orchestration on May 6, 2026. The bet is that you let Anthropic own your agent's memory, evals, and orchestration. Here is why that is the wrong architecture — and the 5 reliability functions that should never live inside the vendor stack.

·Anthropic · Vendor-neutral reliability · Multi-agent

On May 6, 2026, Anthropic shipped three new capabilities into the Managed Agents platform:

  • Dreaming — background reasoning loops that run between user turns, building model-internal "scratchpads" the agent can recall later
  • Outcomes — Anthropic's managed eval harness that scores agent runs against developer-defined success criteria, with results visible inside the Anthropic console
  • Multi-agent orchestration — first-class support for spawning child agents from parent agents, all running inside Anthropic's infrastructure

Each of these is technically excellent. Anthropic shipped the cleanest implementation of each pattern that currently exists. The capabilities themselves are not the problem.

The problem is the architecture they implicitly recommend.

What Anthropic is asking you to adopt

The pitch in the Anthropic announcement and the follow-up coverage is straightforward: give Anthropic ownership of your agent's memory layer, your eval pipeline, and your multi-agent orchestration. In exchange, you get capabilities that are hard to build yourself, optimized for Claude models, with a single integration surface.

For a small team shipping their first agent, that bet looks good. For a production AI system that you need to operate for the next five years, it is the same bet that broke a generation of companies that built their entire stack on Twitter API in 2012, on Facebook's graph in 2014, on Heroku in 2017, on iOS push notifications in 2021.

The pattern is the same every time: a platform offers capabilities so good you adopt them, then changes the rules after you cannot leave.

This is not paranoia. It is operational realism.

Three specific concerns about putting evals inside your model vendor

1. Conflict of interest

When your eval harness is run by the same company that sells you the model, the eval results are no longer independent measurements. They are marketing.

When Anthropic ships Claude Opus 4.8 and its internal Outcomes evals show 4.8 beating 4.7 by 6 points on your workload, you cannot tell whether 4.8 is actually better or whether the eval suite was tuned to favor 4.8. Both can be true. The vendor has no incentive to surface the case where their newer model is worse for your specific workload — that case quietly disappears from the leaderboard.

The whole point of evals is to give you ground truth about your agent's behavior. Ground truth requires independence. Hosting evals inside your model vendor is the same category error as letting OpenAI's gpt-5-judge benchmark gpt-5.

2. Migration cost ratchet

The more of your reliability stack lives inside Anthropic's infrastructure, the more expensive it becomes to evaluate switching models, providers, or architectures. Once your golden dataset of 500 evaluated trajectories lives in Anthropic's Outcomes platform, moving it to a Braintrust or Langfuse takes engineering time you do not have.

Lock-in is not a contract clause. It is the cumulative weight of switching costs. Every additional capability you delegate to Anthropic raises the wall.

The right question is not "will Anthropic ever turn evil?" The right question is "if Anthropic's pricing doubles in 2027 or they discontinue Outcomes the way Google discontinues APIs, can I be on another provider within 30 days?" If the honest answer is no, you have already lost the negotiation that has not happened yet.

3. Dreaming makes audit harder, not easier

Background reasoning loops that build internal scratchpads are powerful. They are also opaque. When your agent does something unexpected and you trace back through the logs, the relevant context is no longer in your transcript — it is in a Dreaming buffer maintained inside Anthropic's infrastructure that you cannot fully inspect.

This is the same problem long-term memory systems have always had: the more state lives outside your observability stack, the harder it is to audit. Dreaming is a sophisticated, well-implemented version of that problem.

For a production agent in a regulated industry (legal, healthcare, fintech), "I do not have full visibility into what my agent was reasoning about when it made this decision" is not an acceptable answer to a compliance audit.

The 5 reliability functions that should never live inside the vendor stack

These are the functions where vendor-neutrality matters most. Build or contract them outside Anthropic, OpenAI, or any single model provider:

1. Your golden dataset

The set of input/output pairs you use to score your agent. Versioned in your repo or your S3 bucket. Format must be vendor-neutral (JSONL, parquet, whatever). When you switch eval tools or model providers, the dataset comes with you.

Anthropic Outcomes is fine for running evals against this dataset. It is not fine for owning the dataset.

2. Your eval scoring logic

The functions that turn an agent run into a pass/fail score. These are code in your repo. They might call Anthropic to use Claude as a judge, but the orchestration of the judging — the prompt, the rubric, the aggregation — lives outside the vendor.

If Anthropic ships a "smart eval mode" that promises to auto-score everything, that is the warning sign, not the feature. Hand-rolled scoring rubrics are part of your moat, not technical debt.

3. Your trajectory observability

The traces of what your agent did, step by step. Every tool call, every retrieval, every model response. Stored in a vendor-neutral observability tool: Langfuse, Braintrust, LangSmith, Helicone, OpenTelemetry-based stack. Not in the model vendor's console.

Anthropic's own console is great for debugging individual Claude calls. It is not great for tracing a multi-step agent run that crosses tools, retrieval, and a model — because the model vendor only sees their own slice of that trace.

4. Your regression testing

The CI step that fails your build if a code change drops eval pass rate by 2+ points. This must live in your CI system (GitHub Actions, CircleCI, Buildkite) and use your eval scoring logic from #2 against your golden dataset from #1.

If your regression testing is a feature inside Anthropic Outcomes, you cannot independently verify that a model upgrade did not regress your agent. The vendor is auditing themselves.

5. Your model selection logic

The code path that decides which model handles which request. Today this might always be Claude. Tomorrow it might route based on cost, latency, capability, or compliance requirements. This selection logic must be your code, not a configuration option inside a single vendor's platform.

The team that wrote model = "claude-opus-4-7" hardcoded in 50 places is the team that cannot evaluate moving 30% of low-priority traffic to a cheaper model when the bill comes in. Keep the routing layer thin, explicit, and yours.

What "good vendor-neutral architecture" looks like

A production agent set up for long-term operational health has:

  • Model calls behind a thin abstraction layer that supports at least 2 vendors (e.g., LiteLLM, Vercel AI SDK, or your own provider interface)
  • Evals running on your CI, against a golden dataset in your repo, using scoring logic in your code
  • Traces flowing into a vendor-neutral observability stack (Langfuse / Braintrust / LangSmith — pick one, but pick one that is not your model vendor)
  • No business logic dependent on vendor-specific managed features like Dreaming or Outcomes. Use them as accelerators if helpful, but not as load-bearing structure.

This setup costs more in week 1 and saves you in year 2.

When managed capabilities are still the right call

Vendor lock-in is not always bad. It is a trade-off. Pick managed capabilities when:

  • You are pre-product-market-fit and need to ship fast
  • The capability is genuinely hard to build (e.g., low-level model serving, GPU orchestration)
  • The lock-in cost is bounded and reversible
  • You have a budget line item for "rebuild on new platform" within 18 months if needed

The Anthropic announcement is genuinely compelling for early-stage teams that need to ship a multi-agent system this quarter. It is not compelling for an enterprise running a production agent that needs to last five years through three model vendors, two compliance regimes, and one inevitable pricing change.

The honest summary

Anthropic shipped excellent capabilities on May 6. Use them as tools, not as foundation. Keep your evals, your trajectory observability, your golden dataset, your regression testing, and your model routing logic outside any single vendor's stack.

The reliability work that fixmyagent.agency does is built around this principle. Every engagement we deliver leaves you with eval suites, monitoring, and routing logic that survive a vendor switch — because we have seen too many teams discover too late that the platform they bet on has changed the rules.

Free 30-min audit + written report in 24 hours — we will tell you which parts of your current stack are vendor-locked and what to do about it. Book at fixmyagent.agency.

— Moazzam Qureshi, Founder, fixmyagent.agency

Working on a production AI agent?

Book a free call. We scope the fix.

A free 30-minute call with an AI expert. We find where your agent is breaking and map the path to fixing it. Then we commit to a result and work until we hit it. No pitch deck.