Documentation

Kalibr

Kalibr sits between your code and the LLM providers. Every call goes through a Router that picks the best model, checks the output, retries on failure, and learns from outcomes over time.

What Kalibr does

Kalibr does three things:

1. Routes each call to the model most likely to succeed. The Router uses Thompson Sampling over tracked outcomes to pick which model handles each request. Early on it explores all paths. After 20 to 50 outcomes per path, it converges on what actually works.

2. Catches bad outputs and retries automatically. You define what "success" looks like with a success_when lambda, a score_when lambda, or a manual router.report() call. When an output fails, Kalibr retries with a different model before returning to your code.

3. Learns continuously. Every outcome feeds back into the routing engine. Models that start degrading lose traffic. Models that work well gain it. No configuration changes needed.

What Kalibr is not: a logging platform (Langfuse, Arize), a model gateway (LiteLLM, OpenRouter), or a prompt optimizer. By default it does not read or modify prompt content; prompts go directly to the provider. When repair_prompt=True is set on the Router, Kalibr will rewrite the user prompt on failure as part of the self-healing loop. Model calls go directly to the provider.

Benchmarks

Tested across 360 real LLM calls spanning 6 task types and 3 difficulty tiers:

Overall pass rate: 73% without Kalibr, 87% with Kalibr (+14pp).

Hard/production tasks: 63% without, 87% with (+24pp).

Auto-recovery: 12 failures recovered automatically per benchmark run.

Quality: Kalibr routing to DeepSeek matches GPT-4o quality within 2pp.

Cost: 94% cheaper than GPT-4o. Cheaper than GPT-4o-mini at scale.

Get started

5 minutes Quickstart Install the SDK, replace one LLM call, and see Kalibr route your first request. Four steps, no architecture changes. Get started → Deep dive Developer integration Full setup with framework integrations (LangChain, CrewAI, OpenAI Agents SDK), manual reporting, scoring, and multi-provider configuration. Full setup →

All documentation

Getting started

Quickstart

Install, replace one call, see routing work

Getting started

Developer integration

Full setup, frameworks, credentials, verify

Reference

How Kalibr works

Thompson Sampling, failure detection, auto-retry

Reference

API reference

Router, completion(), report(), score_when

Reference

Framework integrations

LangChain, CrewAI, OpenAI Agents SDK, HuggingFace

Reference

Goal taxonomy

12 goal types, routing table, eval contracts

Reference

Production guide

Graceful degradation, monitoring, debugging

Help

FAQ

Common setup issues and questions

Help

Troubleshooting

Common errors and how to fix them