How Routing Works
How Kalibr selects models, evaluates outputs, heals failures, and routes traffic based on real outcomes.
Most agent frameworks pick one model and stick with it. When that model silently degrades — or when the output is structurally wrong but the HTTP response was 200 — nothing catches it. Kalibr does.
Kalibr sits between your agent and the model. It selects which model to call, evaluates whether the output actually succeeded, and reroutes automatically when it doesn't. No alerting. No manual rollback. No human required.
The three things Kalibr does
1. Initial model selection from global priors
Before your tenant has any run history for a goal, Kalibr selects a starting model from a global pool of outcome data — aggregated across all tenants, all task types, weighted by task similarity. This warm-start means your first run routes to a model with a known track record for that goal type, not a coin flip.
As your agent accumulates outcomes, tenant-specific data takes over. The global prior becomes a progressively smaller influence. Your routing reflects your actual workload.
2. Two-gated eval on every output
After each model call, Kalibr runs two evaluation gates before the result is considered complete:
Gate 1 — Structural eval (synchronous, every call). A fast, deterministic check that runs inline with no LLM calls. What it checks depends on goal type:
code_generation— Python AST parse passes, or TypeScript has function/class structureweb_scraping— field completeness ≥ 0.8, at least 1 row returnedclassification— returned label is in the allowed setsummarization— output is not a near-verbatim copy, not empty, not a refusallead_scoring— score is numeric and in [0, 100]outreach_generation— subject line and body both present, 50–2,000 charsresearch— at least 200 characters, no error markers in output- All other goal types — output is non-empty and non-trivial
Gate 1 result feeds directly into report(success=bool). No configuration needed — Kalibr knows the success contract for each goal type.
Gate 2 — LLM quality judge (async, ~10% sample rate, research and outreach only). For goals where structural correctness isn't enough to measure quality — specifically research and outreach_generation — Kalibr runs a background quality judge on approximately 10% of outputs that passed Gate 1. The judge uses a cheap model (DeepSeek or Llama 3.3 70B, never a premium model) and returns a float score from 0.0 to 1.0. Scores below 0.6 are treated as low quality. This score feeds into report(success=bool, score=float) and gives the router finer discrimination between models that both pass Gate 1 but produce different quality output.
Gate 2 is fire-and-forget. It never blocks the main execution path. There is no LLM call in the routing hot path — ever.
3. Reactive healing on failure
When Gate 1 fails — structurally bad output, wrong format, empty response, provider error — Kalibr records the failure against that model for this goal. On the next call for the same goal, Kalibr routes to the next-best model based on current success rates. No configuration. No threshold to set. It just switches.
This reroute is what the dashboard calls a heal. Every heal is an execution that would have reached your users as a failure, intercepted and redirected automatically. The heal count on your Agents page is the count of those interventions.
Healing catches failures that HTTP status codes miss: a model that returns 200 with malformed JSON, a summarization model that returns a verbatim copy of the input, a code model that returns syntactically invalid Python. Gate 1 catches all of these. The provider never flagged them as errors.
Scoring signals
Kalibr accepts two types of outcome signals:
- Binary —
report(success=True/False). Updates the model's success rate directly. Every structural eval produces this. - Continuous —
report(success=True, score=0.85). The float score gives finer discrimination. A score of 0.85 counts as 0.85 successes and 0.15 failures in the routing model. Two models that both pass Gate 1 at 90% will look identical on binary scoring — but if one consistently scores 0.92 and the other 0.61, Kalibr routes to the better one. The LLM quality judge (Gate 2) produces this signal automatically for eligible goal types.
Trend detection and drift
Kalibr compares recent performance against historical baseline to detect drift. A model that was working last week may not be working this week — silent provider regressions happen constantly.
A model's trend can be:
- Improving — Recent success rate significantly above baseline
- Stable — Consistent with baseline
- Degrading — Recent success rate significantly below baseline
When a model is degrading, it loses routing priority. When it recovers, routing gradually returns to it. This works across all modalities — a degrading transcription model gets the same treatment as a degrading text LLM.
The Trust Invariant
Kalibr optimizes for success first, cost second. Always.
A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.
Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.
Bypass When Needed
Sometimes you need to override routing:
# Force a specific model
response = router.completion(
messages=[...],
force_model="gpt-4o"
)// Force a specific model
const response = await router.completion(messages, {
forceModel: 'gpt-4o',
});The call is still traced, but routing is bypassed. Use this for:
- Debugging specific model behavior
- Reproducing customer issues
- Load testing a specific provider
Don't use it as your default - you lose the learning benefits.
Cost savings
The trust invariant (see below) guarantees Kalibr never sacrifices reliability for cost. But when two models have similar success rates, Kalibr routes to the cheaper one. Over time this compounds: a model that costs $0.004/call replacing one that costs $0.018/call across thousands of runs is real money. The Cost Saved by Kalibr KPI on your dashboard measures exactly this — the delta between what you spent and what you would have spent routing everything through the most expensive model in your path list.
Auto Path Generation [FEATURE FLAG]
Not enabled by default. When the auto_path_generation flag is on, a background job runs hourly and extends the path registry automatically:
- Identifies numeric parameters (temperature, top_p) where the best-performing value is at the boundary of explored space
- Generates new paths with interpolated values (e.g., if temperature 0.3 is the best and it's the lowest tested, tries 0.15)
- Paths that underperform (>20 percentage points below best) are automatically disabled after 30+ samples
- Maximum 5 auto-generated paths per goal, maximum 3 new paths per goal per run
Contact us to enable this on your account.
Next
- API Reference - Full Router API including get_policy()
- Production Guide - Graceful degradation, monitoring