Why Your AI Agent Works in Dev and Silently Fails in Production
You shipped the agent. Tests pass. The demo looked great. Three days in, a user reports that it's been returning nonsense for two days and nobody noticed.
No exception was raised. No alert fired. The HTTP response was 200 the whole time.
This is the production AI agent problem that nobody talks about enough: silent failure. It's different from the failures we've spent decades learning to handle. It doesn't crash. It doesn't timeout. It just... gets it wrong, quietly, at scale.
The Dev/Prod Gap Is Bigger Than You Think
When you test an AI agent in development, you're testing a narrow slice:
- One or two models, probably the ones you have API keys for
- Stable API behavior. No degradation, no rate limiting pressure
- Hand-picked prompts that you wrote, which happen to match what you tested
- Your own judgment about whether the output is "good"
Production is different in ways that are hard to reproduce locally:
Model behavior drifts. OpenAI, Anthropic, and Google update their models without notice. The same prompt that returned clean JSON last week might return markdown-wrapped JSON today. Your parser breaks. The agent "succeeds" from the infrastructure perspective and fails from the business perspective.
Latency patterns change. Under load, p99 latency spikes. Models that work fine at 2s start timing out at 30s. Your fallback either doesn't exist or is a hardcoded second model that has different output behavior your code doesn't account for.
Input distribution shifts. Real users are not you. They ask things you didn't anticipate. The prompt you wrote handles your test cases; it handles 60% of real inputs well.
Tool reliability varies. External APIs your agent calls have their own degradation patterns. Search APIs go down. Document parsers return unexpected formats. RAG retrieval quality degrades as your index grows stale.
None of these show up in your test suite. All of them hit production.
What Silent Failure Actually Looks Like
Here's a concrete example. You have an agent that extracts structured data from user-submitted text:
import openai
client = openai.OpenAI()
def extract_order_data(raw_text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract order data as JSON with fields: item, quantity, address"},
{"role": "user", "content": raw_text}
]
)
content = response.choices[0].message.content
return json.loads(content) # This will raise if content isn't valid JSON
You add a try/except:
def extract_order_data(raw_text: str) -> dict:
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract order data as JSON with fields: item, quantity, address"},
{"role": "user", "content": raw_text}
]
)
content = response.choices[0].message.content
return json.loads(content)
except Exception as e:
logger.error(f"Extraction failed: {e}")
return {}
This looks like error handling. It's not. Here's what happens in production after a model update:
- The model starts wrapping its JSON in markdown code fences:
`json\n{...}\n` json.loads()raises aJSONDecodeError- The exception is caught
- An empty dict
{}is returned - Downstream code receives
{}, treats it as "empty order," and does... something. Maybe logs it. Maybe passes it along. Maybe sends the user a confirmation for an order with no items. - Your metrics show 200 OK. No errors in your error tracking. Request latency is normal.
You find out two days later when a user complains.
The fix most teams reach for: add JSON parsing cleanup:
import re
def clean_json_response(content: str) -> str:
# Strip markdown code fences
content = re.sub(r'^```json\n?', '', content, flags=re.MULTILINE)
content = re.sub(r'\n?```$', '', content, flags=re.MULTILINE)
return content.strip()
That fixes this specific breakage. Then the model changes something else. You add another patch. And another. This is the whack-a-mole treadmill.
The Human-in-Loop Problem
The typical response to AI reliability issues is "add human review." Put a human in the loop to catch the bad outputs.
This fails at scale for an obvious reason: if humans are reviewing every output, you've eliminated the efficiency gain of the agent. And if they're reviewing a sample, you're still missing most of the failures.
More subtly: humans reviewing AI outputs develop review fatigue. After seeing 500 good outputs, reviewers start rubber-stamping. The 501st bad one slips through.
The observability tools (Langfuse, LangSmith, Helicone) try to address this by giving you dashboards. You can see traces, compare outputs, track cost. But they're still fundamentally human-driven. Somebody has to look at the dashboard, notice the anomaly, diagnose it, and push a fix. That loop takes hours to days. Meanwhile your agent is failing.
Observability is not reliability. Knowing your house is on fire is not the same as having a fire suppression system.
Outcome-Aware Routing: Define What "Good" Means, Let Production Tell You What Works
The core insight: if you can define what a successful outcome looks like, even loosely. You can measure it automatically and route away from paths that are failing.
This is what Kalibr does. You define a success function. Kalibr uses Thompson Sampling to route between model+prompt+tool combinations based on real outcome signals, shifting traffic toward what's working.
Here's the same extraction agent, rewritten with outcome-aware routing:
import kalibr # Must be first, before OpenAI import
import openai
import json
import re
client = openai.OpenAI()
def is_valid_order(result: dict) -> bool:
"""Define what success looks like."""
return (
isinstance(result, dict)
and "item" in result
and "quantity" in result
and "address" in result
and result["item"] # non-empty
)
def extract_with_gpt4o(raw_text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract order data as JSON: {item, quantity, address}"},
{"role": "user", "content": raw_text}
]
)
content = response.choices[0].message.content
# Strip markdown fences if present
content = re.sub(r'^```json\n?|\n?```$', '', content.strip())
return json.loads(content)
def extract_with_claude(raw_text: str) -> dict:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[
{"role": "user", "content": f"Extract as JSON (item, quantity, address): {raw_text}"}
]
)
content = response.content[0].text
content = re.sub(r'^```json\n?|\n?```$', '', content.strip())
return json.loads(content)
router = kalibr.Router(
paths=[extract_with_gpt4o, extract_with_claude],
success_fn=is_valid_order,
task="order-extraction"
)
def extract_order_data(raw_text: str) -> dict:
return router.run(raw_text)
What changed:
- You defined success.
is_valid_orderis your success criterion. It's explicit, testable, and simple. - You gave it alternatives. Two extraction paths. Both are valid; the router will learn which works better right now.
- Production outcome signals drive routing. If gpt-4o starts failing
is_valid_orderon 30% of requests, Kalibr shifts traffic to the Claude path automatically, without a human noticing and deploying a fix.
The benchmark numbers: during model degradation events, hardcoded systems (even ones with fallback logic) hit 16-36% success rates. Outcome-aware routing maintains 88-100%. See the full benchmark.
The Dev Testing Problem Doesn't Go Away: It Gets Smaller
Outcome-aware routing doesn't eliminate the need to test. You still need to validate your success function. You still need to make sure your paths actually work.
What it eliminates is the need for your test suite to predict every possible production failure mode. You can't anticipate model drift, input distribution shifts, or external API degradation in advance. You can write a clear success function and let the production system adapt.
This shifts your testing focus:
- Test your success function make sure
is_valid_ordercorrectly classifies good and bad outputs - Test each path in isolation make sure the individual functions work correctly
- Don't try to test every production edge case let production discover those and route around them
When to Use Kalibr
- You have an agent in production (or near production) that calls LLMs on real user data
- You can define what a successful output looks like, even a simple boolean check
- You want the system to adapt to model changes without manual intervention
- You have (or can write) at least two viable execution paths
When Not to Use Kalibr
- You're still in early prototyping, figure out your paths first
- You can't define success programmatically, if "good output" requires human judgment every time, routing can't help you
- You have a single LLM call with no alternatives, routing needs alternatives to route between
- Your failure mode is catastrophic/irreversible, outcome-aware routing reduces failure rate; it doesn't eliminate it. If a single wrong answer has severe consequences, human review is still the right answer.
The Real Fix
The dev/prod gap exists because production has information your dev environment doesn't: real inputs, real model behavior right now, real failure patterns. The only way to close that gap is to bring production feedback into the routing decision.
Stop hardcoding paths and hoping they hold. Define what success looks like, give the system options, and let production tell you what's working.
That's outcome-aware routing. And it's the thing that turns "my agent silently fails sometimes" into "my agent routes around failures automatically."
*Want to dig deeper? See the production checklist post for the full set of reliability requirements, or Kalibr's docs for the SDK reference.*
Kalibr keeps complex AI agents running without human intervention.
Get started free