Production Guide

Running Kalibr in production.

When Kalibr helps most

Multiple models: You have 2+ models that could work, and you don't know which is best
Flaky tools: Model performance varies over time (degradations, updates)
Long-running agents: Enough volume to generate outcome data

Kalibr needs outcomes to learn. If you have <10 calls/day for a goal, learning will be slow.

Failure modes

No outcomes reported

Symptom: Routing stays random
Cause: report() never called
Fix: Add report(success=...) after every completion

Low traffic

Symptom: Routing changes slowly or not at all
Cause: Need ~20-50 outcomes before Kalibr routes confidently
Fix: Wait for more outcomes to accumulate, or add test traffic

Before enough run data is collected

First calls explore randomly
No preference until outcomes are reported
First path in list is fallback if intelligence service is unavailable

Cost & latency

The trust invariant

Success rate ALWAYS dominates. Cost/latency only break ties among paths within 5% of best success rate.

iIf GPT-4o has 95% success and GPT-4o-mini has 85% success, Kalibr routes to GPT-4o regardless of cost. Cost only matters when success rates are within 5% of each other.

Turning Kalibr off

What happens if you remove it

If you remove Kalibr imports, you must replace Router calls with direct SDK calls.

How to fall back safely

Use force_model/forceModel to bypass routing:

python

response = router.completion(
    messages=[...],
    force_model="gpt-4o"  # Always use gpt-4o, ignore routing
)

typescript

const response = await router.completion(messages, {
  forceModel: 'gpt-4o',  // Always use gpt-4o, ignore routing
});

Or replace Router with direct SDK calls:

python

# From this:
response = router.completion(messages=[...])

# To this:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(model="gpt-4o", messages=[...])

# For HuggingFace tasks, from this:
result = router.execute(task="automatic_speech_recognition", input_data=audio)

# To this:
from huggingface_hub import InferenceClient
client = InferenceClient()
result = client.automatic_speech_recognition(audio)

typescript

// From this:
const response = await router.completion(messages);

// To this:
import OpenAI from 'openai';
const client = new OpenAI();
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages,
});

Error Handling Patterns

Provider errors vs Intelligence service errors

Kalibr handles two types of errors differently:

Provider errors (OpenAI, Anthropic, Google, HuggingFace)

Router re-raises the exception to your code
You must handle these with try/except or try/catch
Kalibr auto-reports as failure before raising

python

try:
    response = router.completion(messages=[...])
    # Outcome auto-reported by completion()
except Exception as e:
    # Already auto-reported as failure
    if "RateLimitError" in str(type(e)):
        time.sleep(60)
    else:
        log_error(e)

typescript

try {
  const response = await router.completion(messages);
  // Outcome auto-reported by completion()
} catch (error) {
  // Already auto-reported as failure
  if (error.message.includes('RateLimitError')) {
    await new Promise(r => setTimeout(r, 60000));
  } else {
    console.error(error);
  }
}

Intelligence service errors

Router falls back to first path automatically
Your code keeps running
Logged as warning

Implication: If the intelligence service is down, your agent uses the first path until it recovers.

Multi-turn Conversations

For chat agents with multiple turns:

python

router = Router(
    goal="customer_support",
    paths=["gpt-4o", "claude-sonnet-4-20250514"]
)

conversation = [{"role": "user", "content": "I need help"}]

# Turn 1 - router decides model
response1 = router.completion(messages=conversation)
selected_model = response1.model

conversation.append({
    "role": "assistant",
    "content": response1.choices[0].message.content
})

# Turn 2 - force same model
conversation.append({"role": "user", "content": "That didn't work"})
response2 = router.completion(
    messages=conversation,
    force_model=selected_model
)

# Report once at end
router.report(success=issue_resolved)

typescript

const router = new Router({
  goal: 'customer_support',
  paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
});

const conversation: Message[] = [{ role: 'user', content: 'I need help' }];

// Turn 1 - router decides model
const response1 = await router.completion(conversation);
const selectedModel = response1.model;

conversation.push({
  role: 'assistant',
  content: response1.choices[0].message.content,
});

// Turn 2 - force same model
conversation.push({ role: 'user', content: "That didn't work" });
const response2 = await router.completion(conversation, {
  forceModel: selectedModel,
});

// Report once at end
await router.report(issueResolved);

Key principles:

Use force_model/forceModel to keep the same model across turns
Report once at the end, not after each turn
Build conversation history correctly

Thread Safety

Router is not thread-safe. Create one Router instance per thread or async context.

Wrong (race condition):

python

router = Router(goal="extract", paths=[...])

# Two threads using same router
thread1: router.completion(...)  # Sets trace_id=ABC
thread2: router.completion(...)  # Overwrites trace_id=XYZ
thread1: router.report(success=True)  # Reports for XYZ (WRONG!)

Right:

python

# Thread 1
router1 = Router(goal="extract", paths=[...])
router1.completion(...)
router1.report(success=True)

# Thread 2
router2 = Router(goal="extract", paths=[...])
router2.completion(...)
router2.report(success=True)

Wrong (race condition):

typescript

const router = new Router({ goal: 'extract', paths: [...] });

// Two async contexts using same router
context1: await router.completion(...)  // Sets trace_id=ABC
context2: await router.completion(...)  // Overwrites trace_id=XYZ
context1: await router.report(true)     // Reports for XYZ (WRONG!)

Right:

typescript

// Request handler 1
const router1 = new Router({ goal: 'extract', paths: [...] });
await router1.completion(...);
await router1.report(true);

// Request handler 2
const router2 = new Router({ goal: 'extract', paths: [...] });
await router2.completion(...);
await router2.report(true);

iTypeScript note: In serverless/edge functions, create a new Router per request. In long-running Node.js apps, create separate instances per async context.

Troubleshooting Routing

If routing isn't improving, check these common issues:

1. No outcomes being reported

Check: Go to your dashboard. Are outcomes appearing for your goal?

Fix: Make sure you're calling router.report(success=...) after every completion.

2. Not enough data yet

Check: Do you have >20 outcomes per path for this goal?

Why: Kalibr needs ~20-50 outcomes per path per goal before routing becomes stable. Before that, expect more exploration.

3. Success criteria too noisy

Check: Are both models showing similar success rates (e.g., both at 60%)?

Why: If all paths perform similarly, routing will stay exploratory. This might mean your task is too hard for current models, or your success criteria needs refinement.

4. Low traffic

Check: Are you making at least 10-20 calls per day per goal?

Why: With low traffic, it takes longer to gather enough outcomes for Kalibr to learn which model works best.

Path ordering and fallback

If the Kalibr intelligence service is unreachable, routing falls back silently to the first path in your paths list. Order your paths accordingly: put your most reliable, always-available path first.

python

router = Router(
    goal="classify_icp",
    paths=[
        "gpt-4o-mini",          # first = emergency fallback, always available
        "deepseek-chat",         # lower cost, routes here when working well
        "claude-haiku",          # alternative
    ]
)

This is not a problem if the intelligence service is healthy, which it is in normal operation. Kalibr fails open: your model calls always succeed, routing just becomes static temporarily.

Graceful Degradation

If Kalibr is unavailable (network error, service down), the SDK falls back to the first path in your list. Your application never crashes due to Kalibr being unreachable.

python

paths = ["gpt-4o", "claude-sonnet-4-20250514"]
# If Kalibr is down, gpt-4o is used automatically

typescript

const paths = ['gpt-4o', 'claude-sonnet-4-20250514'];
// If Kalibr is down, gpt-4o is used automatically

Best practice: Put your most reliable path first. This becomes your fallback.

Trend Monitoring

Check the dashboard for paths marked as "degrading". These are paths where recent performance is significantly worse than historical baseline.

Common causes:

Provider model updates (silent changes to model behavior)
Changes in your input distribution
Upstream API issues or rate limiting
Prompt changes that affect certain models differently

When you see a degrading path:

Check provider status pages
Review recent changes to your prompts or inputs
Consider temporarily disabling the path if degradation is severe

When to Use force_model

The force_model parameter bypasses routing:

python

response = router.completion(
    messages=[...],
    force_model="gpt-4o"
)

typescript

const response = await router.completion(messages, {
  forceModel: 'gpt-4o',
});

Use it for:

Debugging specific model behavior
Reproducing customer-reported issues
Load testing a specific provider
Temporary workarounds during incidents

Don't use it as your default: you lose the learning benefits and won't detect regressions.

Latency Overhead

Kalibr adds a routing decision before each completion. Typical overhead:

Cold (first request): 50-100ms
Warm (cached routing state): 10-30ms

For latency-critical paths, you can:

Use get_policy() to cache recommendations
Lower exploration rate to reduce variability
Use force_model for paths where latency is critical

Monitoring with Insights API

get_insights() returns actionable signals that tell you (or your coding agent) what to investigate:

path_underperforming, a path is significantly worse than alternatives
failure_mode_dominant, one failure type dominates
drift_detected, performance is degrading over time
cost_inefficiency, overpaying for quality you don't need
param_sensitivity_detected, a parameter value significantly affects outcomes
low_confidence, not enough data to be sure
goal_healthy, everything is working

python

from kalibr import get_insights

insights = get_insights()
for goal in insights["goals"]:
    if goal["status"] in ("failing", "degrading"):
        print(f"{goal['goal']}: {goal['status']}")
        for signal in goal["actionable_signals"]:
            print(f"  {signal['type']}: {signal['data']}")

Structured Failure Categories for Debugging

Adding failure_category to report() calls enables the insights engine to cluster failures by type. Instead of parsing free-text error strings, you get clean aggregation: "60% of failures are timeouts."

python

router.report(success=False, failure_category="timeout",
              reason="Provider timed out after 30s")

# Valid categories: timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknown

Auto Path Generation

When enabled via feature flag (auto_path_generation), Kalibr automatically explores parameter space by generating new path configurations. A background job runs hourly and:

Identifies numeric parameters where the best value is at the boundary of explored space
Generates new paths with interpolated values
Lets exploration traffic evaluate them automatically
Cleans up paths that underperform (>20pp below best) after 30+ samples
Caps at 5 auto-generated paths per goal, max 3 new per run

Feature-flagged per tenant. Contact us to enable.

FAQ. Common questions

Production Guide

When Kalibr helps most

Failure modes

No outcomes reported

Low traffic

Before enough run data is collected

Cost & latency

The trust invariant

Turning Kalibr off

What happens if you remove it

How to fall back safely

Error Handling Patterns

Provider errors vs Intelligence service errors

Provider errors (OpenAI, Anthropic, Google, HuggingFace)

Intelligence service errors

Multi-turn Conversations

Thread Safety

Wrong (race condition):

Right:

Wrong (race condition):

Right:

Troubleshooting Routing

1. No outcomes being reported

2. Not enough data yet

3. Success criteria too noisy

4. Low traffic

Path ordering and fallback

Graceful Degradation

Trend Monitoring

When to Use force_model

Latency Overhead

Monitoring with Insights API

Structured Failure Categories for Debugging

Auto Path Generation

Next