Reference

How Routing Works

How Kalibr selects models, evaluates outputs, heals failures, and routes traffic based on real outcomes.

Most agent frameworks pick one model and stick with it. When that model silently degrades — or when the output is structurally wrong but the HTTP response was 200 — nothing catches it. Kalibr does.

Kalibr sits between your agent and the model. It selects which model to call, evaluates whether the output actually succeeded, and reroutes automatically when it doesn't. No alerting. No manual rollback. No human required.

The three things Kalibr does

1. Initial model selection from global priors

Before your tenant has any run history for a goal, Kalibr selects a starting model from a global pool of outcome data — aggregated across all tenants, all task types, weighted by task similarity. This warm-start means your first run routes to a model with a known track record for that goal type, not a coin flip.

As your agent accumulates outcomes, tenant-specific data takes over. The global prior becomes a progressively smaller influence. Your routing reflects your actual workload.

2. Two-gated eval on every output

After each model call, Kalibr runs two evaluation gates before the result is considered complete:

Gate 1 — Structural eval (synchronous, every call). A fast, deterministic check that runs inline with no LLM calls. What it checks depends on goal type:

Gate 1 result feeds directly into report(success=bool). No configuration needed — Kalibr knows the success contract for each goal type.

Gate 2 — LLM quality judge (async, ~10% sample rate, research and outreach only). For goals where structural correctness isn't enough to measure quality — specifically research and outreach_generation — Kalibr runs a background quality judge on approximately 10% of outputs that passed Gate 1. The judge uses a cheap model (DeepSeek or Llama 3.3 70B, never a premium model) and returns a float score from 0.0 to 1.0. Scores below 0.6 are treated as low quality. This score feeds into report(success=bool, score=float) and gives the router finer discrimination between models that both pass Gate 1 but produce different quality output.

Gate 2 is fire-and-forget. It never blocks the main execution path. There is no LLM call in the routing hot path — ever.

3. Reactive healing on failure

When Gate 1 fails — structurally bad output, wrong format, empty response, provider error — Kalibr records the failure against that model for this goal. On the next call for the same goal, Kalibr routes to the next-best model based on current success rates. No configuration. No threshold to set. It just switches.

This reroute is what the dashboard calls a heal. Every heal is an execution that would have reached your users as a failure, intercepted and redirected automatically. The heal count on your Agents page is the count of those interventions.

Healing catches failures that HTTP status codes miss: a model that returns 200 with malformed JSON, a summarization model that returns a verbatim copy of the input, a code model that returns syntactically invalid Python. Gate 1 catches all of these. The provider never flagged them as errors.

Scoring signals

Kalibr accepts two types of outcome signals:

Trend detection and drift

Kalibr compares recent performance against historical baseline to detect drift. A model that was working last week may not be working this week — silent provider regressions happen constantly.

A model's trend can be:

When a model is degrading, it loses routing priority. When it recovers, routing gradually returns to it. This works across all modalities — a degrading transcription model gets the same treatment as a degrading text LLM.

The Trust Invariant

Kalibr optimizes for success first, cost second. Always.

A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.

Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.

Bypass When Needed

Sometimes you need to override routing:

python
# Force a specific model
response = router.completion(
    messages=[...],
    force_model="gpt-4o"
)
typescript
// Force a specific model
const response = await router.completion(messages, {
  forceModel: 'gpt-4o',
});

The call is still traced, but routing is bypassed. Use this for:

Don't use it as your default - you lose the learning benefits.

Cost savings

The trust invariant (see below) guarantees Kalibr never sacrifices reliability for cost. But when two models have similar success rates, Kalibr routes to the cheaper one. Over time this compounds: a model that costs $0.004/call replacing one that costs $0.018/call across thousands of runs is real money. The Cost Saved by Kalibr KPI on your dashboard measures exactly this — the delta between what you spent and what you would have spent routing everything through the most expensive model in your path list.

Auto Path Generation [FEATURE FLAG]

Not enabled by default. When the auto_path_generation flag is on, a background job runs hourly and extends the path registry automatically:

Contact us to enable this on your account.

Next