Kalibr sits between your code and the LLM providers. Every call goes through a Router that picks the best model, checks the output, retries on failure, and learns from outcomes over time.
Kalibr does three things:
1. Routes each call to the model most likely to succeed. The Router uses Thompson Sampling over tracked outcomes to pick which model handles each request. Early on it explores all paths. After 20 to 50 outcomes per path, it converges on what actually works.
2. Catches bad outputs and retries automatically. You define what "success" looks like with a success_when lambda, a score_when lambda, or a manual router.report() call. When an output fails, Kalibr retries with a different model before returning to your code.
3. Learns continuously. Every outcome feeds back into the routing engine. Models that start degrading lose traffic. Models that work well gain it. No configuration changes needed.
Tested across 360 real LLM calls spanning 6 task types and 3 difficulty tiers:
Overall pass rate: 73% without Kalibr, 87% with Kalibr (+14pp).
Hard/production tasks: 63% without, 87% with (+24pp).
Auto-recovery: 12 failures recovered automatically per benchmark run.
Quality: Kalibr routing to DeepSeek matches GPT-4o quality within 2pp.
Cost: 94% cheaper than GPT-4o. Cheaper than GPT-4o-mini at scale.