Kalibr routes each agent request to the lowest-cost model still succeeding for that task. These are the building blocks it uses to make that decision, and to keep improving it over time.
Kalibr works the same way regardless of who is doing the wiring. What changes is who defines the goals, writes the success criteria, and iterates on routing configuration.
You define goals, write success_when callbacks, and call router.report(). Full control. Works well when success criteria are clear up front.
You own: goal definition, success criteria, path selection, outcome interpretation
A coding agent (Claude Code, OpenClaw, Cursor) acts as the engineer. It defines goals, sets success criteria, instruments code, reads insights, adds paths, and iterates. Outcome reporting burden disappears because the agent handles it.
Agent owns: all of the above. Kalibr owns: routing decisions, statistical learning, degradation detection.
Either way, Kalibr does the same thing: route to the cheapest path still succeeding, detect when paths degrade, and improve over time.
When an agent is the orchestrator, it runs a continuous improvement loop:
get_insights() to read failure modes, path performance, and trendsThis also resolves the cold start: the agent monitors sample_count and confidence per path, and proactively adds exploration traffic or new paths when learning is slow.
A goal is a task with a consistent success criterion.
Good goals:
book_meetingextract_companyclassify_ticketgenerate_sqlBad goals:
handle_request (too vague)llm_call (no success criterion)Each goal gets its own routing state. Kalibr learns independently for each.
extract_company vs extract_company_with_domainsummarize_email vs summarize_transcriptA path is a complete execution configuration. Paths work across any modality: text LLMs, voice models, image generators, embedding models, and any model on HuggingFace.
Just models (Python):
paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"]
Just models (TypeScript):
const paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"];
Model + tool combinations (Python):
paths = [
{"model": "gpt-4o", "tools": ["calendar_api"]},
{"model": "gpt-4o", "tools": ["google_calendar"]},
{"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]Model + tool combinations (TypeScript):
const paths = [
{ model: "gpt-4o", tools: ["calendar_api"] },
{ model: "gpt-4o", tools: ["google_calendar"] },
{ model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];Model + tool + parameter combinations (Python):
paths = [
{"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
{"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]Model + tool + parameter combinations (TypeScript):
const paths = [
{ model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
{ model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.
An outcome is what you report after execution: success or failure, optionally with a continuous quality score.
Python:
# Binary outcome router.report(success=True) router.report(success=False, reason="invalid_time") # Continuous quality score, feeds directly into routing router.report(success=True, score=0.85) # Score provides finer signal than binary alone. # A path scoring 0.85 consistently will be preferred # over one scoring 0.6, even if both technically "succeed."
TypeScript:
await router.report(true); await router.report(false, "invalid_time"); await router.report(true, undefined, 0.85);
Without outcomes, Kalibr can't learn. This is the feedback loop.
What Kalibr tracks per path:
What Kalibr ignores:
Kalibr needs outcomes to learn. During the initial phase, routing behaves as follows:
For low-traffic goals this phase can take days. If you are using agent-orchestrated setup, your agent can monitor sample_count and confidence via get_insights() and respond by adding more test traffic or new paths when confidence is low.
Kalibr captures execution telemetry and serves it back as structured intelligence. The full loop:
1. Report outcomes, your agent reports success or failure after each task, optionally with a continuous quality score (0-1) and a structured failure_category (timeout, tool_error, hallucination_detected, etc.). The continuous score feeds directly into Thompson Sampling for finer-grained routing.
2. Kalibr learns, Thompson Sampling updates beliefs about which paths work best, using both binary outcomes and continuous quality scores. Trend detection identifies degradation. Rollback monitoring disables failing paths automatically.
3. Query insights, a coding agent calls get_insights() and receives structured diagnostics: which goals are healthy, which are failing, which failure modes dominate, which paths underperform, which parameters matter.
4. Update outcomes, when real-world signals arrive later (customer reopened ticket 48 hours after "resolution"), update_outcome() corrects the record. Every downstream component learns from the correction.
5. Auto-explore, when enabled, Kalibr automatically generates new path configurations by interpolating parameter values (e.g., trying temperature 0.2 if 0.3 was the best value tested). New paths are evaluated through existing exploration traffic.
The human's role: set goals, define success criteria, own billing, check in occasionally. Everything else is agent-to-agent.
Instead of free-text failure reasons that can't be aggregated, Kalibr supports structured failure categories. These enable clean clustering: "60% of failures for this goal are timeouts" rather than parsing thousands of unique error strings.
from kalibr import FAILURE_CATEGORIES
# timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknown
router.report(success=False, failure_category="timeout",
reason="Provider timed out after 30s")You can add constraints to routing decisions:
Python:
policy = get_policy(
goal="book_meeting",
constraints={
"max_cost_usd": 0.05,
"max_latency_ms": 2000,
"min_quality": 0.8
}
)TypeScript:
const policy = await getPolicy("book_meeting", {
constraints: {
maxCostUsd: 0.05,
maxLatencyMs: 2000,
minQuality: 0.8
}
});Kalibr will only recommend paths that meet all constraints.