Running Kalibr in production.
Kalibr needs outcomes to learn. If you have <10 calls/day for a goal, learning will be slow.
By default, Kalibr explores 10% of the time. This means 1 in 10 calls goes to a non-optimal path to gather data.
router = Router(
goal="extract_company",
paths=["gpt-4o", "claude-sonnet-4-20250514"],
exploration_rate=0.05 # 5% exploration
)const router = new Router({
goal: 'extract_company',
paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
explorationRate: 0.05, // 5% exploration
});Success rate ALWAYS dominates. Cost/latency only break ties among paths within 5% of best success rate.
If you remove Kalibr imports, you must replace Router calls with direct SDK calls.
Use force_model/forceModel to bypass routing:
response = router.completion(
messages=[...],
force_model="gpt-4o" # Always use gpt-4o, ignore routing
)const response = await router.completion(messages, {
forceModel: 'gpt-4o', // Always use gpt-4o, ignore routing
});Or replace Router with direct SDK calls:
# From this: response = router.completion(messages=[...]) # To this: from openai import OpenAI client = OpenAI() response = client.chat.completions.create(model="gpt-4o", messages=[...]) # For HuggingFace tasks, from this: result = router.execute(task="automatic_speech_recognition", input_data=audio) # To this: from huggingface_hub import InferenceClient client = InferenceClient() result = client.automatic_speech_recognition(audio)
// From this:
const response = await router.completion(messages);
// To this:
import OpenAI from 'openai';
const client = new OpenAI();
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages,
});Kalibr handles two types of errors differently:
try:
response = router.completion(messages=[...])
# Outcome auto-reported by completion()
except Exception as e:
# Already auto-reported as failure
if "RateLimitError" in str(type(e)):
time.sleep(60)
else:
log_error(e)try {
const response = await router.completion(messages);
// Outcome auto-reported by completion()
} catch (error) {
// Already auto-reported as failure
if (error.message.includes('RateLimitError')) {
await new Promise(r => setTimeout(r, 60000));
} else {
console.error(error);
}
}Implication: If the intelligence service is down, your agent uses the first path until it recovers.
For chat agents with multiple turns:
router = Router(
goal="customer_support",
paths=["gpt-4o", "claude-sonnet-4-20250514"]
)
conversation = [{"role": "user", "content": "I need help"}]
# Turn 1 - router decides model
response1 = router.completion(messages=conversation)
selected_model = response1.model
conversation.append({
"role": "assistant",
"content": response1.choices[0].message.content
})
# Turn 2 - force same model
conversation.append({"role": "user", "content": "That didn't work"})
response2 = router.completion(
messages=conversation,
force_model=selected_model
)
# Report once at end
router.report(success=issue_resolved)const router = new Router({
goal: 'customer_support',
paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
});
const conversation: Message[] = [{ role: 'user', content: 'I need help' }];
// Turn 1 - router decides model
const response1 = await router.completion(conversation);
const selectedModel = response1.model;
conversation.push({
role: 'assistant',
content: response1.choices[0].message.content,
});
// Turn 2 - force same model
conversation.push({ role: 'user', content: "That didn't work" });
const response2 = await router.completion(conversation, {
forceModel: selectedModel,
});
// Report once at end
await router.report(issueResolved);Key principles:
Router is not thread-safe. Create one Router instance per thread or async context.
router = Router(goal="extract", paths=[...]) # Two threads using same router thread1: router.completion(...) # Sets trace_id=ABC thread2: router.completion(...) # Overwrites trace_id=XYZ thread1: router.report(success=True) # Reports for XYZ (WRONG!)
# Thread 1 router1 = Router(goal="extract", paths=[...]) router1.completion(...) router1.report(success=True) # Thread 2 router2 = Router(goal="extract", paths=[...]) router2.completion(...) router2.report(success=True)
const router = new Router({ goal: 'extract', paths: [...] });
// Two async contexts using same router
context1: await router.completion(...) // Sets trace_id=ABC
context2: await router.completion(...) // Overwrites trace_id=XYZ
context1: await router.report(true) // Reports for XYZ (WRONG!)// Request handler 1
const router1 = new Router({ goal: 'extract', paths: [...] });
await router1.completion(...);
await router1.report(true);
// Request handler 2
const router2 = new Router({ goal: 'extract', paths: [...] });
await router2.completion(...);
await router2.report(true);If routing isn't improving, check these common issues:
Check: Go to your dashboard. Are outcomes appearing for your goal?
Fix: Make sure you're calling router.report(success=...) after every completion.
Check: Do you have >20 outcomes per path for this goal?
Why: Kalibr needs ~20-50 outcomes per path per goal before routing becomes stable. Before that, expect more exploration.
Check: Are both models showing similar success rates (e.g., both at 60%)?
Why: If all paths perform similarly, routing will stay exploratory. This might mean your task is too hard for current models, or your success criteria needs refinement.
Check: Are you making at least 10-20 calls per day per goal?
Why: With low traffic, it takes longer to gather enough outcomes. Consider lowering exploration_rate if you need faster convergence.
If the Kalibr intelligence service is unreachable, routing falls back silently to the first path in your paths list. Order your paths accordingly: put your most reliable, always-available path first.
router = Router(
goal="classify_icp",
paths=[
"gpt-4o-mini", # first = emergency fallback, always available
"deepseek-chat", # lower cost, routes here when working well
"claude-haiku", # alternative
]
)This is not a problem if the intelligence service is healthy, which it is in normal operation. Kalibr fails open: your model calls always succeed, routing just becomes static temporarily.
If Kalibr is unavailable (network error, service down), the SDK falls back to the first path in your list. Your application never crashes due to Kalibr being unreachable.
paths = ["gpt-4o", "claude-sonnet-4-20250514"] # If Kalibr is down, gpt-4o is used automatically
const paths = ['gpt-4o', 'claude-sonnet-4-20250514']; // If Kalibr is down, gpt-4o is used automatically
Best practice: Put your most reliable path first. This becomes your fallback.
Check the dashboard for paths marked as "degrading". These are paths where recent performance is significantly worse than historical baseline.
Common causes:
When you see a degrading path:
The force_model parameter bypasses routing:
response = router.completion(
messages=[...],
force_model="gpt-4o"
)const response = await router.completion(messages, {
forceModel: 'gpt-4o',
});Use it for:
Don't use it as your default - you lose the learning benefits and won't detect regressions.
Kalibr adds a routing decision before each completion. Typical overhead:
For latency-critical paths, you can:
get_insights() returns actionable signals that tell you (or your coding agent) what to investigate:
from kalibr import get_insights
insights = get_insights()
for goal in insights["goals"]:
if goal["status"] in ("failing", "degrading"):
print(f"{goal['goal']}: {goal['status']}")
for signal in goal["actionable_signals"]:
print(f" {signal['type']}: {signal['data']}")Adding failure_category to report() calls enables the insights engine to cluster failures by type. Instead of parsing free-text error strings, you get clean aggregation: "60% of failures are timeouts."
router.report(success=False, failure_category="timeout",
reason="Provider timed out after 30s")
# Valid categories: timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknownWhen enabled via feature flag (auto_path_generation), Kalibr automatically explores parameter space by generating new path configurations. A background job runs hourly and:
Feature-flagged per tenant. Contact us to enable.