Insights

/

Most AI Agents Aren't Production Ready

52% of enterprises say they have AI agents deployed. Only 5% worry about tool-calling accuracy. The gap between "deployed" and "reliable" is where the real work begins.

52% of enterprises say they have AI agents deployed. Only 5% worry about tool-calling accuracy. The gap between "deployed" and "reliable" is where the real work begins.

/

AUTHOR

Vlad

52% of enterprises say they have AI agents in production. Only 5% of engineering leaders running those agents worry about whether they call the right tools. That gap tells you everything about where the industry actually is.

The numbers come from two credible sources. Google Cloud's September 2025 ROI study surveyed 3,466 senior leaders across 24 countries and found that over half have deployed agents. Cleanlab's August 2025 survey of 95 engineering leaders with agents live in production found that almost none are focused on tool-calling accuracy, the single capability that separates a useful agent from a chatbot with extra steps.

The conclusion is uncomfortable: most "production" AI agents are deployed in name only.

The 52% Illusion
When Google Cloud reports that 52% of enterprises have agents deployed, that number includes everything from a customer support bot answering FAQs to a multi-step workflow processing financial transactions. These are not the same thing.


The 2026 State of AI Agents Report breaks this down further. 57% of organizations run multi-step agent workflows. But only 16% have agents spanning multiple teams or systems. The rest are single-function assistants with an "agent" label.


Meanwhile, 80% of respondents report measurable ROI from their agents. That sounds good until you look at what "measurable" means in practice. Most of these gains come from automating repetitive tasks that a well-configured webhook could handle. The hard problems, the ones requiring an agent to reason about context, select the right tool, and act safely within regulated constraints, remain mostly unsolved.


Why Most Agents Break in Production
Three patterns keep showing up across every enterprise AI survey published in the last year.


The stack won't stop moving. 70% of regulated enterprises rebuild their AI agent stack every three months or faster, according to Cleanlab's research. That is not iteration. That is instability. When your foundation shifts quarterly, nothing built on top of it stays reliable for long. One respondent described migrating from LangChain to Azure in two months, only to consider reverting. This is common.


Integration is the actual hard problem. 46% of respondents in the 2026 State of AI Agents Report cite integration with existing systems as their primary challenge. Not model quality. Not prompt engineering. Connecting an LLM to a CRM, an internal API, or a compliance system in a way that is secure, auditable, and reliable is the bottleneck. The model is the easy part.


Nobody can see what's happening. Fewer than one in three teams are satisfied with their observability and guardrails. 63% plan to fix this in the next year, making visibility the top investment priority. If you cannot trace what your agent did, why it did it, and whether it was correct, you do not have a production system. You have a prototype with users.

What Production-Grade Actually Looks Like
At Systemartis, we build AI agents for environments where getting it wrong has real consequences. Our team's experience spans regulated banking, telecom at scale, and high-volume e-commerce, and those domains have taught us that agent reliability is an engineering problem, not a prompt engineering problem.
When we build MCP (Model Context Protocol) servers for agentic architectures, the requirements are non-negotiable:

  • Validated tool schemas. Every action the agent can take is defined by a strict API contract. The agent cannot improvise. If the schema does not permit an action, the action does not happen.


  • Data boundaries. The agent accesses sensitive data through controlled MCP endpoints with proper authentication propagation. No raw database access. No ambient authority.


  • Human-in-the-loop gates. Actions that modify state require explicit confirmation. Data gathering is automated. Decisions are not.


  • Audit trails. Every tool call, every response, every decision point is logged. In regulated industries, "the AI did it" is not an acceptable explanation.


We applied these principles when building an AI logistics agent for an e-commerce client. The agent queries courier APIs and updates customers proactively via WhatsApp and Telegram, handling roughly 40% of "Where is my order?" support tickets that previously required human attention. It works because the agent's authority is scoped precisely: it can read shipment status and send templated updates. It cannot modify orders, issue refunds, or access data outside its defined scope.


The difference between this and a demo agent is not sophistication. It is discipline.


The Klarna Warning
Klarna's experience is worth studying. Their AI agent handles two-thirds of all customer inquiries across 150 million users, equivalent to 853 full-time employees. Resolution time dropped from 11 minutes to under 2. They saved $60 million.


Then in May 2025, Klarna's CEO admitted the company had "gone too far" with AI-only service and announced a return to hybrid human-AI support.


The lesson is not that agents fail. The lesson is that agents without proper boundaries will eventually hit a case they cannot handle, and if there is no human escalation path, the failure is visible, public, and expensive. Klarna could absorb it. Most companies cannot.


Building Agents That Survive Contact with Reality
The organizations getting real value from AI agents share a few traits that have nothing to do with model selection.


They start with the failure mode, not the happy path. Before building, they ask: what happens when the agent calls the wrong tool? Returns confidently wrong data? Exceeds its authority? If you cannot answer these questions before deployment, you are not ready to deploy.


They treat MCP as an architectural pattern, not a buzzword. Model Context Protocol gives agents controlled access to external systems through defined interfaces. At Systemartis, we build custom MCP servers that act as a translation layer between natural language requests and validated API calls. The agent does not talk to your database. It talks to a server that enforces rules about what it can ask and what it can do.


They measure reliability, not just capability. Capability is "can the agent do this task?" Reliability is "does the agent do this task correctly 99.7% of the time, and what happens in the other 0.3%?" Most teams optimize for the first question and ignore the second.


They build for quarterly stack churn. If 70% of regulated enterprises rebuild their agent stack every three months, the architecture must be modular enough to survive it. Tight coupling to any single framework or provider is a liability.
The market is moving fast. 81% of organizations plan to expand into more complex agent use cases in 2026. But speed without reliability is just a faster way to break things.


The companies that will lead are not the ones deploying the most agents. They are the ones deploying agents that work when it matters.