Benchmark Methodology
How we measured routing intelligence impact on agent outcomes.
Note: These benchmarks are internal and directional, intended to measure the impact of intelligent routing rather than provide universal performance guarantees. Results are strongest for routing-sensitive workloads where model choice significantly affects outcomes.
Summary
We tested Kalibr's ability to improve agent outcomes through intelligent model routing.
On calculation-intensive tasks where model choice significantly affects outcomes, failure rates dropped from 64% to 4% — a 16× reduction. We claim "10× fewer failures" to be conservative.
Other task types showed 1.3-1.6× improvement. The largest gains occur where there's clear asymmetry in model capabilities.
Methodology
Models Evaluated
gpt-4o-mini(OpenAI)gpt-4o(OpenAI)claude-sonnet-4-20250514(Anthropic)
Test Framework
Multi-step research agent pipeline with 6 stages:
- Query Analysis
- Search Query Generation
- Web Search Execution
- Information Extraction
- Answer Synthesis
- Validation
Task Categories
| Category | Description | Evaluation |
|---|---|---|
| Calculation | Multi-step math/numerical reasoning | Exact numerical match |
| Factual | Specific facts from authoritative sources | Exact string match |
| Temporal | Current events requiring recent information | Verified against sources |
| Synthesis | Combining information from multiple sources | Completeness + accuracy |
Test Conditions
Baseline (Phase 1):
- 300 runs (100 tasks × 3 models)
- Each model tested on identical tasks
- No routing intelligence — fixed model per run
- All outcomes reported to build training data
Kalibr-Routed (Phase 2):
- 150 tasks with Kalibr making routing decisions
- Model selection based on learned outcome patterns
- Real API calls to OpenAI and Anthropic
- No artificial failure injection
Results
Phase 1: Baseline Model Performance
Raw performance by model and task type (25 tasks per cell):
| Task Type | gpt-4o-mini | gpt-4o | claude-sonnet |
|---|---|---|---|
| gsm8k_math | 88% | 88% | 96% |
| code_gen | 96% | 92% | 92% |
| logic_puzzle | 80% | 88% | 84% |
| long_context | 100% | 100% | 100% |
Key finding: Different models excel at different task types. This is the foundation for intelligent routing.
Phase 2: Kalibr-Routed Performance
| Metric | Baseline | Kalibr-Routed | Improvement |
|---|---|---|---|
| Overall Success | 55.3% | 86.7% | +57% |
| Calculation Tasks | 36% | 96% | +167% (2.67×) |
| Factual Tasks | ~60% | 90% | +50% |
| Temporal Tasks | ~55% | 90% | +64% |
| Synthesis Tasks | ~50% | 64% | +28% |
The Headline Number
Calculation task failure rates:
- Without intelligent routing: 64% failure (36% success)
- With Kalibr routing: 4% failure (96% success)
- Reduction: 16× fewer failures
We claim "10× fewer failures" because 16× is the upper bound on a specific task type. 10× is conservative and defensible.
Why Calculation Tasks Show Maximum Impact
Calculation tasks expose the largest performance gap between models:
- Claude Sonnet achieved 96% on GSM8K-style math problems
- Without routing, the system defaulted to cheaper models that achieved only 36%
- Kalibr learned this pattern and routed calculation tasks to Claude Sonnet
Limitations
- Task-specific gains: The 10× improvement applies to calculation-heavy workloads. Other task types show 1.3-1.6× improvement.
- Seeded learning: Some tests used pre-seeded outcome data to accelerate learning. Production systems would require a cold-start learning period.
- Model set: Tested with 3 models from 2 providers. Results may vary with different model pools.
- Question bank size: 150 unique questions. Larger-scale tests would provide more statistical confidence.
Defensible Claims
Based on this data, the following claims are supported:
- ✅ "Up to 10× fewer failures on routing-sensitive tasks" — Calculation tasks: 64% → 4% failure rate = 16× reduction
- ✅ "Nearly 3× improvement on calculation task success" — 36% → 96% = 2.67×
- ✅ "87% success rate with intelligent routing" — Production-realistic test achieved 86.7%
- ✅ "Kalibr learns which model works best for each task type" — Demonstrated goal-specific routing
Do not claim:
- ❌ "Agents fail 10× less overall"
- ❌ "Kalibr reduces failures by 10× in production"
- ❌ Any universal performance guarantee
Last updated: December 2024 · Questions? Contact team@kalibr.ai