April 2026

The Production Agent Checklist: What Every AI Agent Needs Before It Touches Real Users

Most AI agents that reach production aren't ready for it. They work in demos. They pass the tests the developer wrote. Then they hit real users and start failing in ways that are hard to detect and harder to debug.

This is a practical checklist. Not "10 tips to improve your AI," not a sales pitch, a real pre-flight list for teams shipping Python agents to production. Work through it before you flip the traffic switch.


1. Error Handling That Actually Handles Errors

The wrong version:

def call_llm(prompt: str) -> str:
 response = openai.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}]
 )
 return response.choices[0].message.content

This crashes on rate limits, network errors, and API outages. It also returns empty strings or None if the model returns an unexpected response format, which happens more than you'd think.

The right version:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)

def call_llm(prompt: str, max_retries: int = 3) -> Optional[str]:
 last_error = None
 
 for attempt in range(max_retries):
 try:
 response = openai.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}],
 timeout=30
 )
 content = response.choices[0].message.content
 if not content or not content.strip():
 logger.warning(f"Empty response on attempt {attempt + 1}")
 continue
 return content
 
 except openai.RateLimitError as e:
 wait = 2 ** attempt # exponential backoff
 logger.warning(f"Rate limited, waiting {wait}s (attempt {attempt + 1})")
 time.sleep(wait)
 last_error = e
 
 except openai.APITimeoutError as e:
 logger.warning(f"Timeout on attempt {attempt + 1}")
 last_error = e
 
 except openai.APIError as e:
 logger.error(f"API error: {e}")
 last_error = e
 break # Don't retry on 4xx errors
 
 logger.error(f"All attempts failed. Last error: {last_error}")
 return None

Checklist items here:


2. Retry Logic With Jitter

Exponential backoff without jitter causes thundering herd: all your retrying clients hit the API at the same time, get rate limited again, back off the same amount, and pile up again.

import random
import time

def backoff_with_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
 """Full jitter: random value between 0 and min(cap, base * 2^attempt)"""
 return random.uniform(0, min(cap, base * (2 ** attempt)))

# Usage
for attempt in range(max_retries):
 try:
 result = call_llm(prompt)
 break
 except RateLimitError:
 if attempt < max_retries - 1:
 sleep_time = backoff_with_jitter(attempt)
 time.sleep(sleep_time)

The tenacity library handles this well if you don't want to roll it yourself:

from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type

@retry(
 retry=retry_if_exception_type(openai.RateLimitError),
 wait=wait_random_exponential(min=1, max=60),
 stop=stop_after_attempt(5)
)
def call_llm_with_retry(prompt: str) -> str:
 response = openai.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}]
 )
 return response.choices[0].message.content

Checklist:


3. Fallback Paths

A fallback is a different execution path you switch to when the primary path fails. This is distinct from retrying, retrying hits the same path again; fallbacks try something different.

Common fallback patterns:

from typing import Optional

def extract_with_gpt4o(text: str) -> Optional[dict]:
 # Primary path
 ...

def extract_with_claude(text: str) -> Optional[dict]:
 # Fallback path
 ...

def extract_with_regex(text: str) -> Optional[dict]:
 # Last-resort deterministic fallback
 import re
 # Simple pattern matching, less capable but always works
 ...

def extract_order(text: str) -> dict:
 result = extract_with_gpt4o(text)
 if result:
 return result
 
 logger.warning("GPT-4o extraction failed, trying Claude")
 result = extract_with_claude(text)
 if result:
 return result
 
 logger.warning("Claude extraction failed, trying regex fallback")
 result = extract_with_regex(text)
 if result:
 return result
 
 raise ValueError("All extraction paths failed")

This is better than no fallback. It has a serious problem though: the fallback selection is static. You wrote it once, it stays that way forever. If Claude starts outperforming GPT-4o in production, your code still tries GPT-4o first every time.

We'll address this in Post 3 on dynamic routing, but the checklist item here is simply: do you have a fallback at all?

Checklist:


4. Outcome Tracking

This is the one most teams skip, and it's the one that matters most for long-term reliability.

Logging requests and responses is not outcome tracking. Outcome tracking is recording whether the agent achieved its goal for each request.

import time
from dataclasses import dataclass
from typing import Optional, Any

@dataclass
class AgentOutcome:
 request_id: str
 task: str
 success: bool
 path_used: str # which model/tool combination
 latency_ms: float
 input_tokens: Optional[int]
 output_tokens: Optional[int]
 error: Optional[str]
 metadata: dict

def track_outcome(outcome: AgentOutcome):
 # Send to your metrics system
 # Could be Datadog, Prometheus, a database, whatever
 metrics.increment(
 "agent.outcome",
 tags=[
 f"task:{outcome.task}",
 f"success:{outcome.success}",
 f"path:{outcome.path_used}"
 ]
 )
 if outcome.latency_ms > 5000:
 metrics.increment("agent.slow_request", tags=[f"task:{outcome.task}"])

The key is defining "success" programmatically. For every agent task, you need to be able to answer: did this work?

def is_extraction_successful(result: Optional[dict]) -> bool:
 if not result:
 return False
 required_fields = {"item", "quantity", "address"}
 return required_fields.issubset(result.keys()) and all(result[f] for f in required_fields)

# After every extraction:
success = is_extraction_successful(result)
track_outcome(AgentOutcome(
 request_id=request_id,
 task="order-extraction",
 success=success,
 path_used="gpt-4o",
 latency_ms=elapsed_ms,
 ...
))

Checklist:


5. Cost Monitoring

LLM costs are variable and can spike unexpectedly. An agent bug that causes excessive retrying or unusually long prompts can cost you serious money before you notice.

from dataclasses import dataclass

# Rough cost per 1K tokens (check current pricing)
COST_PER_1K_TOKENS = {
 "gpt-4o": {"input": 0.0025, "output": 0.010},
 "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
 "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
 "claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
 if model not in COST_PER_1K_TOKENS:
 return 0.0
 rates = COST_PER_1K_TOKENS[model]
 return (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])

def call_with_cost_tracking(prompt: str, model: str = "gpt-4o") -> tuple[str, float]:
 response = openai.chat.completions.create(
 model=model,
 messages=[{"role": "user", "content": prompt}]
 )
 usage = response.usage
 cost = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)
 
 # Alert if single call is unusually expensive
 if cost > 0.10: # $0.10 threshold, tune for your use case
 logger.warning(f"Expensive LLM call: ${cost:.4f} ({usage.prompt_tokens} input, {usage.completion_tokens} output)")
 
 return response.choices[0].message.content, cost

Checklist:


6. Observability vs. Reliability: Don't Confuse Them

This is where teams make a category error.

Observability tools (LangSmith, Langfuse, Helicone, Weights & Biases) give you visibility into what's happening. Traces, spans, prompt logs, output comparison. They're genuinely useful for debugging and evaluation. Use them.

Reliability tools ensure the agent keeps working when things go wrong. Retries, fallbacks, circuit breakers, outcome-based routing. These operate at request time, not review time.

The difference: observability tells you your agent is failing. Reliability keeps it from failing, or recovers it automatically.

Here's an honest comparison of tools that often get conflated:

KalibrLangSmithOpenRouter
Primary purposeOutcome-based path routingTracing, evaluation, debuggingModel gateway (cost/latency)
Adapts at runtime?Yes, reroutes based on outcomesNo, dashboards for humansPartial, routes by cost/latency, not outcomes
Success signalYour programmatic success functionHuman eval / labeled dataNone (cost and latency only)
When it helpsModel degrades, tool fails, path breaks in productionDebugging why something failed, evaluating prompt qualityReducing cost, hitting multiple providers
Requires human?No, adapts automaticallyYes, someone looks at the dashboardNo
Learning mechanismThompson Sampling on outcome signalsN/AStatic rules or weighted routing

These are not competing tools. A production agent might legitimately use all three:

See Kalibr's docs for how the SDK fits into an existing stack.


7. Output Validation

Never trust LLM output directly. Validate it before passing it to anything downstream.

import json
from pydantic import BaseModel, ValidationError
from typing import Optional

class OrderData(BaseModel):
 item: str
 quantity: int
 address: str
 notes: Optional[str] = None

def parse_and_validate_order(llm_output: str) -> Optional[OrderData]:
 # Clean up common formatting issues
 content = llm_output.strip()
 
 # Strip markdown code fences
 if content.startswith("```"):
 lines = content.split("\n")
 content = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])
 
 try:
 data = json.loads(content)
 return OrderData(**data)
 except json.JSONDecodeError as e:
 logger.warning(f"JSON parse failed: {e}. Raw: {content[:200]}")
 return None
 except ValidationError as e:
 logger.warning(f"Schema validation failed: {e}")
 return None

Checklist:


8. Rate Limiting and Circuit Breakers

Your agent should protect the APIs it calls, not just itself.

from collections import deque
import time
from threading import Lock

class CircuitBreaker:
 def __init__(self, failure_threshold: int = 5, timeout: float = 60.0):
 self.failure_threshold = failure_threshold
 self.timeout = timeout
 self.failures = 0
 self.last_failure_time = 0
 self.state = "closed" # closed = normal, open = blocking, half-open = testing
 self._lock = Lock()
 
 def call(self, func, *args, **kwargs):
 with self._lock:
 if self.state == "open":
 if time.time() - self.last_failure_time > self.timeout:
 self.state = "half-open"
 else:
 raise Exception("Circuit breaker open, service unavailable")
 
 try:
 result = func(*args, **kwargs)
 with self._lock:
 if self.state == "half-open":
 self.state = "closed"
 self.failures = 0
 return result
 except Exception as e:
 with self._lock:
 self.failures += 1
 self.last_failure_time = time.time()
 if self.failures >= self.failure_threshold:
 self.state = "open"
 logger.error(f"Circuit breaker opened after {self.failures} failures")
 raise

# Usage
openai_breaker = CircuitBreaker(failure_threshold=5, timeout=30)

def call_openai_safe(prompt: str) -> str:
 return openai_breaker.call(call_llm, prompt)

Checklist:


9. Timeouts Everywhere

This is short because it's simple: set explicit timeouts on everything.

import asyncio
from concurrent.futures import ThreadPoolExecutor, TimeoutError

async def call_with_timeout(prompt: str, timeout_seconds: float = 30) -> Optional[str]:
 loop = asyncio.get_event_loop()
 
 with ThreadPoolExecutor() as executor:
 try:
 result = await asyncio.wait_for(
 loop.run_in_executor(executor, call_llm, prompt),
 timeout=timeout_seconds
 )
 return result
 except asyncio.TimeoutError:
 logger.warning(f"LLM call timed out after {timeout_seconds}s")
 return None

Checklist:


Putting It Together: The Minimum Viable Production Agent

Here's what a minimal production-ready agent looks like, integrating the items above:

import kalibr # First, before any model SDK imports
import openai
import time
import logging
import json
from typing import Optional
from pydantic import BaseModel, ValidationError

logger = logging.getLogger(__name__)

class ExtractionResult(BaseModel):
 item: str
 quantity: int
 address: str

def success_fn(result: Optional[ExtractionResult]) -> bool:
 return result is not None

def extract_gpt4o(text: str) -> Optional[ExtractionResult]:
 try:
 response = openai.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": "Extract order fields as JSON: item, quantity, address"},
 {"role": "user", "content": text}
 ],
 timeout=20
 )
 content = response.choices[0].message.content.strip()
 content = content.strip("```json").strip("```").strip()
 return ExtractionResult(**json.loads(content))
 except Exception as e:
 logger.warning(f"GPT-4o extraction error: {e}")
 return None

def extract_claude(text: str) -> Optional[ExtractionResult]:
 try:
 import anthropic
 ac = anthropic.Anthropic()
 response = ac.messages.create(
 model="claude-3-5-sonnet-20241022",
 max_tokens=512,
 messages=[{"role": "user", "content": f"Extract as JSON (item, quantity, address): {text}"}]
 )
 content = response.content[0].text.strip()
 content = content.strip("```json").strip("```").strip()
 return ExtractionResult(**json.loads(content))
 except Exception as e:
 logger.warning(f"Claude extraction error: {e}")
 return None

# Kalibr router: outcome-based routing between paths
router = kalibr.Router(
 paths=[extract_gpt4o, extract_claude],
 success_fn=success_fn,
 task="order-extraction"
)

def process_order(text: str) -> Optional[ExtractionResult]:
 start = time.time()
 result = router.run(text)
 elapsed_ms = (time.time() - start) * 1000
 
 logger.info(f"Extraction {'succeeded' if result else 'failed'} in {elapsed_ms:.0f}ms")
 return result

This isn't complete production code, you'd add cost tracking, circuit breakers, and proper metrics. But it covers the core: validated output, multiple paths, outcome-aware routing that adapts automatically.


The Checklist, Condensed

Error handling:

Retries:

Fallbacks:

Outcome tracking:

Cost monitoring:

Validation:

Infrastructure:

If you can check every box, your agent is ready for production. Most teams can't check them all on day one, that's fine. Work through it in priority order.


*Related: Why Your AI Agent Works in Dev and Silently Fails in Production covers the detection problem in more depth. Stop Hardcoding Model Fallbacks covers outcome-based routing in detail.*

Kalibr keeps complex AI agents running without human intervention.

Get started free