Most AI Agents Aren't Production Ready

52% of enterprises say they have AI agents in production. Only 5% of the engineering leaders running those agents worry about whether they call the right tools. That gap tells you most of what you need to know about where the industry actually is.

The numbers come from two credible sources. Google Cloud's September 2025 ROI study surveyed 3,466 senior leaders across 24 countries and found that over half have deployed agents. Cleanlab's August 2025 survey of 95 engineering leaders with agents live in production found that almost none are focused on tool-calling accuracy, the single capability that separates a useful agent from a chatbot with extra steps.

Key takeaways

52% of enterprises report AI agents in production, but only 5% track tool-calling accuracy (Google Cloud 2025; Cleanlab 2025).
Only 16% of organisations run agents spanning multiple teams or systems; most are single-function assistants with an "agent" label.
Integration with existing systems, not model quality, is the top challenge (46%); observability is the top planned fix (63%).
70% of regulated enterprises rebuild their agent stack every three months or faster, which makes stability the exception.
Klarna's agent saved a reported 60 million dollars, then the company walked back AI-only support in 2025: agents without a human escalation path fail in public.

Why is there a gap between "deployed" and "reliable"?

Because "deployed" counts everything. When Google Cloud reports that 52% of enterprises have agents deployed, that number includes a support bot answering FAQs and a multi-step workflow processing financial transactions. Those are not the same thing.

The 2026 State of AI Agents Report breaks it down: 57% of organisations run multi-step agent workflows, but only 16% have agents spanning multiple teams or systems. The rest are single-function assistants with an "agent" label. And while 80% report measurable ROI, most of the gains come from automating repetitive tasks a well-configured webhook could handle. The hard problems, the ones requiring an agent to reason about context, pick the right tool, and act safely inside regulated constraints, remain mostly unsolved.

Why do most agents break in production?

Three patterns show up across every enterprise AI survey published in the last year.

The stack will not stop moving. 70% of regulated enterprises rebuild their AI agent stack every three months or faster, according to Cleanlab. That is not iteration, it is instability. One respondent described migrating from LangChain to Azure in two months, then considering reverting.
Integration is the actual hard problem. 46% of respondents cite integration with existing systems as their primary challenge. Not model quality, not prompts. Connecting an LLM to a CRM, an internal API, or a compliance system in a way that is secure, auditable, and reliable is the bottleneck.
Nobody can see what is happening. Fewer than one in three teams are satisfied with their observability and guardrails, and 63% plan to fix it next. If you cannot trace what your agent did, why, and whether it was correct, you have a prototype with users, not a production system.

What does a production-grade agent look like?

It looks like an engineering problem solved with engineering, not prompts. At Systemartis we build agents for environments where getting it wrong has real consequences, across regulated banking, telecom at scale, and high-volume e-commerce. When we build MCP servers for agentic architectures, four requirements are non-negotiable:

Validated tool schemas. Every action the agent can take is defined by a strict API contract. If the schema does not permit it, it does not happen.
Data boundaries. The agent reaches sensitive data only through controlled MCP endpoints with proper authentication propagation. No raw database access, no ambient authority.
Human-in-the-loop gates. Actions that modify state require explicit confirmation. Data gathering is automated; decisions are not.
Audit trails. Every tool call, every response, every decision point is logged. In regulated industries, "the AI did it" is not an acceptable explanation.

We applied these when building a logistics agent for an e-commerce client. It queries courier APIs and updates customers proactively over WhatsApp and Telegram, handling roughly 40% of "where is my order?" tickets that used to need a human. It works because its authority is scoped precisely: it can read shipment status and send templated updates. It cannot modify orders, issue refunds, or touch anything outside that scope. The difference between this and a demo is not sophistication. It is discipline.

What is the lesson from Klarna?

That an agent without boundaries will eventually hit a case it cannot handle, and the failure will be public. Klarna's AI handled two-thirds of customer inquiries across 150 million users, equivalent to 853 full-time staff, and dropped resolution time from 11 minutes to under 2 while reportedly saving 60 million dollars.

Then in May 2025 Klarna's CEO admitted the company had gone too far with AI-only service and announced a return to hybrid human and AI support. The lesson is not that agents fail. It is that an agent with no human escalation path turns an edge case into a headline. Klarna could absorb it. Most companies cannot.

How do you build agents that survive contact with reality?

Start with the failure mode, not the happy path. The organisations getting real value share a few traits that have nothing to do with model selection:

They ask what happens when the agent calls the wrong tool or exceeds its authority, before they deploy.
They treat MCP as an architectural pattern, not a buzzword: a server that enforces what the agent can ask and do, so the agent never talks to the database directly.
They measure reliability, not just capability. Capability is "can the agent do this task?" Reliability is "does it do this correctly almost every time, and what happens the rest of the time?"
They build for quarterly churn, because tight coupling to any single framework or provider is a liability when most stacks are rebuilt every three months.

The companies that lead will not be the ones deploying the most agents. They will be the ones deploying agents that work when it matters.

Most AI Agents Aren't Production Ready

Key takeaways

Why is there a gap between "deployed" and "reliable"?

Why do most agents break in production?

What does a production-grade agent look like?

What is the lesson from Klarna?

How do you build agents that survive contact with reality?

More from the team.

The Shift to Agentic: Why Basic Chatbots Are Dead

Why 'Glue Code' Is the Most Important Skill in Modern Engineering

Bring us the workflow that keeps breaking.