Benchmark Methodology

How we measured routing intelligence impact on agent outcomes.

Note: These benchmarks are internal and directional, intended to measure the impact of intelligent routing rather than provide universal performance guarantees. Results are strongest for routing-sensitive workloads where model choice significantly affects outcomes.

Summary

We tested Kalibr's ability to improve agent outcomes through intelligent model routing.

10×

fewer failures on routing-sensitive tasks

On calculation-intensive tasks where model choice significantly affects outcomes, failure rates dropped from 64% to 4% — a 16× reduction. We claim "10× fewer failures" to be conservative.

Other task types showed 1.3-1.6× improvement. The largest gains occur where there's clear asymmetry in model capabilities.

Methodology

Models Evaluated

gpt-4o-mini (OpenAI)
gpt-4o (OpenAI)
claude-sonnet-4-20250514 (Anthropic)

Test Framework

Multi-step research agent pipeline with 6 stages:

Query Analysis
Search Query Generation
Web Search Execution
Information Extraction
Answer Synthesis
Validation

Task Categories

Category	Description	Evaluation
Calculation	Multi-step math/numerical reasoning	Exact numerical match
Factual	Specific facts from authoritative sources	Exact string match
Temporal	Current events requiring recent information	Verified against sources
Synthesis	Combining information from multiple sources	Completeness + accuracy

Test Conditions

Baseline (Phase 1):

300 runs (100 tasks × 3 models)
Each model tested on identical tasks
No routing intelligence — fixed model per run
All outcomes reported to build training data

Kalibr-Routed (Phase 2):

150 tasks with Kalibr making routing decisions
Model selection based on learned outcome patterns
Real API calls to OpenAI and Anthropic
No artificial failure injection

Results

Phase 1: Baseline Model Performance

Raw performance by model and task type (25 tasks per cell):

Task Type	gpt-4o-mini	gpt-4o	claude-sonnet
gsm8k_math	88%	88%	96%
code_gen	96%	92%	92%
logic_puzzle	80%	88%	84%
long_context	100%	100%	100%

Key finding: Different models excel at different task types. This is the foundation for intelligent routing.

Phase 2: Kalibr-Routed Performance

Metric	Baseline	Kalibr-Routed	Improvement
Overall Success	55.3%	86.7%	+57%
Calculation Tasks	36%	96%	+167% (2.67×)
Factual Tasks	~60%	90%	+50%
Temporal Tasks	~55%	90%	+64%
Synthesis Tasks	~50%	64%	+28%

The Headline Number

Calculation task failure rates:

Without intelligent routing: 64% failure (36% success)
With Kalibr routing: 4% failure (96% success)
Reduction: 16× fewer failures

We claim "10× fewer failures" because 16× is the upper bound on a specific task type. 10× is conservative and defensible.

Why Calculation Tasks Show Maximum Impact

Calculation tasks expose the largest performance gap between models:

Claude Sonnet achieved 96% on GSM8K-style math problems
Without routing, the system defaulted to cheaper models that achieved only 36%
Kalibr learned this pattern and routed calculation tasks to Claude Sonnet

Limitations

Task-specific gains: The 10× improvement applies to calculation-heavy workloads. Other task types show 1.3-1.6× improvement.
Seeded learning: Some tests used pre-seeded outcome data to accelerate learning. Production systems would require a cold-start learning period.
Model set: Tested with 3 models from 2 providers. Results may vary with different model pools.
Question bank size: 150 unique questions. Larger-scale tests would provide more statistical confidence.

Defensible Claims

Based on this data, the following claims are supported:

✅ "Up to 10× fewer failures on routing-sensitive tasks" — Calculation tasks: 64% → 4% failure rate = 16× reduction
✅ "Nearly 3× improvement on calculation task success" — 36% → 96% = 2.67×
✅ "87% success rate with intelligent routing" — Production-realistic test achieved 86.7%
✅ "Kalibr learns which model works best for each task type" — Demonstrated goal-specific routing

Do not claim:

❌ "Agents fail 10× less overall"
❌ "Kalibr reduces failures by 10× in production"
❌ Any universal performance guarantee

Last updated: December 2024 · Questions? Contact team@kalibr.ai