Reference

How Kalibr Works

Kalibr detects failures, heals them automatically, and learns which model works best for each task over time.

Most agent frameworks pick one model and stick with it. When that model silently degrades, or when the output is structurally wrong but the HTTP response was 200, nothing catches it. Kalibr does.

Kalibr sits between your agent and the model. It evaluates whether every output actually succeeded, reroutes automatically when it doesn't, and learns from outcomes to pick the best model over time. No alerting. No manual rollback. No human required.

1. Kalibr detects failures

After each model call, Kalibr runs two evaluation gates before the result is considered complete:

Gate 1. Structural eval (synchronous, every call). A fast, deterministic check that runs inline with no LLM calls. What it checks depends on goal type:

code_generation Python AST parse passes, or TypeScript has function/class structure
web_scraping field completeness ≥ 0.8, at least 1 row returned
classification returned label is in the allowed set
summarization output is not a near-verbatim copy, not empty, not a refusal
lead_scoring score is numeric and in [0, 100]
outreach_generation subject line and body both present, 50–2,000 chars
research at least 200 characters, no error markers in output
All other goal types, output is non-empty and non-trivial

Gate 1 result feeds directly into report(success=bool). No configuration needed, Kalibr knows the success contract for each goal type.

Gate 2 - LLM quality judge (async, ~10% sample rate, research and outreach only). For goals where structural correctness isn't enough to measure quality, specifically research and outreach_generation Kalibr runs a background quality judge on approximately 10% of outputs that passed Gate 1. The judge uses a cheap model (DeepSeek or Llama 3.3 70B, never a premium model) and returns a float score from 0.0 to 1.0. Scores below 0.6 are treated as low quality. This score feeds into report(success=bool, score=float) and gives the router finer discrimination between models that both pass Gate 1 but produce different quality output.

Gate 2 is fire-and-forget. It never blocks the main execution path. There is no LLM call in the routing hot path, ever.

2. Kalibr heals failures

When Gate 1 fails — structurally bad output, wrong format, empty response, provider error — Kalibr records the failure against that model for this goal. On the next call for the same goal, Kalibr routes to the next-best model based on current success rates. No configuration. No threshold to set. It just switches.

This reroute is what the dashboard calls a heal. Every heal is an execution that would have reached your users as a failure, intercepted and redirected automatically. The heal count on your Agents page is the count of those interventions.

Healing catches failures that HTTP status codes miss: a model that returns 200 with malformed JSON, a summarization model that returns a verbatim copy of the input, a code model that returns syntactically invalid Python. Gate 1 catches all of these. The provider never flagged them as errors.

3. Kalibr learns from outcomes

Initial model selection from global priors

Before your tenant has any run history for a goal, Kalibr selects a starting model from a global pool of outcome data, aggregated across all tenants, all task types, weighted by task similarity. This warm-start means your first run routes to a model with a known track record for that goal type, not a coin flip.

As your agent accumulates outcomes, tenant-specific data takes over. The global prior becomes a progressively smaller influence. Your routing reflects your actual workload.

Scoring signals

Kalibr accepts two types of outcome signals:

Binary report(success=True/False). Updates the model's success rate directly. Every structural eval produces this.
Continuous report(success=True, score=0.85). The float score gives finer discrimination. A score of 0.85 counts as 0.85 successes and 0.15 failures in the routing model. Two models that both pass Gate 1 at 90% will look identical on binary scoring, but if one consistently scores 0.92 and the other 0.61, Kalibr routes to the better one. The LLM quality judge (Gate 2) produces this signal automatically for eligible goal types.

Trend detection and drift

Kalibr compares recent performance against historical baseline to detect drift. A model that was working last week may not be working this week, silent provider regressions happen constantly.

A model's trend can be:

Improving Recent success rate significantly above baseline
Stable Consistent with baseline
Degrading Recent success rate significantly below baseline

When a model is degrading, it loses routing priority. When it recovers, routing gradually returns to it. This works across all modalities, a degrading transcription model gets the same treatment as a degrading text LLM.

The Trust Invariant

Kalibr optimizes for success first, cost second. Always.

A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.

Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.

Bypass When Needed

Sometimes you need to override routing:

python

# Force a specific model
response = router.completion(
 messages=[...],
 force_model="gpt-4o"
)

typescript

// Force a specific model
const response = await router.completion(messages, {
 forceModel: 'gpt-4o',
});

The call is still traced, but routing is bypassed. Use this for:

Debugging specific model behavior
Reproducing customer issues
Load testing a specific provider

Don't use it as your default. You lose the learning benefits.

Cost savings

The trust invariant (see below) guarantees Kalibr never sacrifices reliability for cost. But when two models have similar success rates, Kalibr routes to the cheaper one. Over time this compounds: a model that costs $0.004/call replacing one that costs $0.018/call across thousands of runs is real money. The Cost Saved by Kalibr KPI on your dashboard measures exactly this, the delta between what you spent and what you would have spent routing everything through the most expensive model in your path list.

Auto Path Generation [FEATURE FLAG]

Not enabled by default. When the auto_path_generation flag is on, a background job runs hourly and extends the path registry automatically:

Identifies numeric parameters (temperature, top_p) where the best-performing value is at the boundary of explored space
Generates new paths with interpolated values (e.g., if temperature 0.3 is the best and it's the lowest tested, tries 0.15)
Paths that underperform (>20 percentage points below best) are automatically disabled after 30+ samples
Maximum 5 auto-generated paths per goal, maximum 3 new paths per goal per run

API Reference. Full Router API including get_policy()
Production Guide. Graceful degradation, monitoring