Reference

How Routing Works

Statistical methods, exploration, and the trust invariant.

Kalibr is different. You report outcomes (success or failure) and Kalibr learns from them. This creates a feedback loop that other routers don't have:

This is why Kalibr can detect problems that other routers miss - semantic failures that still return HTTP 200.

Statistical Foundation

Kalibr uses Thompson Sampling for routing decisions and Wilson score intervals for confidence estimation. These are well-established algorithms for balancing exploration vs exploitation.

Binary and Continuous Signals

When you report a binary outcome (success=True/False), Kalibr updates its beliefs with +1 success or +1 failure. When you include a continuous score (score=0.85), Kalibr uses it as a fractional signal, a score of 0.85 counts as 0.85 successes and 0.15 failures. This gives Thompson Sampling much finer discrimination between paths.

For example: two paths both "succeed" 90% of the time. With binary scoring, they look identical. But if Path A consistently scores 0.92 and Path B scores 0.61, Kalibr routes to Path A, it produces higher-quality output, even though both technically pass.

Why this approach?

You don't need to understand the math. The short version: Kalibr tries paths proportionally to how likely they are to be best, based on evidence so far.

Confidence and Sample Size

Kalibr is conservative with small samples. A path with 5 successes out of 5 attempts isn't trusted more than a path with 80 out of 100.

This matters because:

As sample size grows, confidence grows - but Kalibr never stops exploring entirely.

Exploration vs Exploitation

Cold start: When a goal is new, Kalibr explores randomly until it has enough data to make informed decisions.

Steady state: After sufficient data, Kalibr mostly exploits the best-performing path while continuing to test alternatives. This lets it detect when conditions change.

You can adjust the exploration rate:

python
router = Router(
    goal="extract_company",
    paths=["gpt-4o", "claude-sonnet-4-20250514"],
    exploration_rate=0.05  # Lower = more exploitation
)
typescript
const router = new Router({
  goal: 'extract_company',
  paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
  explorationRate: 0.05,  // Lower = more exploitation
});

Lower exploration = more consistent, slower to adapt
Higher exploration = more variance, faster to detect changes

For high-stakes production tasks, use lower exploration. For experimental features, use higher.

Trend Detection

Kalibr compares recent performance against historical baseline to detect drift.

A path can be:

This catches silent model regressions. When a provider pushes a bad update, Kalibr notices and routes away - often before you'd notice manually. This works across any modality: a degrading transcription model gets the same treatment as a degrading text LLM.

The Trust Invariant

Kalibr optimizes for success first, cost second. Always.

A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.

Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.

Bypass When Needed

Sometimes you need to override routing:

python
# Force a specific model
response = router.completion(
    messages=[...],
    force_model="gpt-4o"
)
typescript
// Force a specific model
const response = await router.completion(messages, {
  forceModel: 'gpt-4o',
});

The call is still traced, but routing is bypassed. Use this for:

Don't use it as your default - you lose the learning benefits.

Auto Path Generation

The path registry is normally static, paths must be explicitly registered by your code. When the auto_path_generation feature flag is enabled, a background job runs hourly and extends the exploration space automatically:

This means the system can discover that temperature 0.2 outperforms 0.3 without anyone explicitly registering that path.

Next