Benchmark Methodology

How we measured routing intelligence impact on agent outcomes.

Note: These benchmarks are internal and directional, intended to measure the impact of intelligent routing rather than provide universal performance guarantees. Results are strongest for routing-sensitive workloads where model choice significantly affects outcomes.

Summary

We tested Kalibr's ability to improve agent outcomes through intelligent model routing.

10×
fewer failures on routing-sensitive tasks

On calculation-intensive tasks where model choice significantly affects outcomes, failure rates dropped from 64% to 4% — a 16× reduction. We claim "10× fewer failures" to be conservative.

Other task types showed 1.3-1.6× improvement. The largest gains occur where there's clear asymmetry in model capabilities.


Methodology

Models Evaluated

  • gpt-4o-mini (OpenAI)
  • gpt-4o (OpenAI)
  • claude-sonnet-4-20250514 (Anthropic)

Test Framework

Multi-step research agent pipeline with 6 stages:

  1. Query Analysis
  2. Search Query Generation
  3. Web Search Execution
  4. Information Extraction
  5. Answer Synthesis
  6. Validation

Task Categories

Category Description Evaluation
Calculation Multi-step math/numerical reasoning Exact numerical match
Factual Specific facts from authoritative sources Exact string match
Temporal Current events requiring recent information Verified against sources
Synthesis Combining information from multiple sources Completeness + accuracy

Test Conditions

Baseline (Phase 1):

  • 300 runs (100 tasks × 3 models)
  • Each model tested on identical tasks
  • No routing intelligence — fixed model per run
  • All outcomes reported to build training data

Kalibr-Routed (Phase 2):

  • 150 tasks with Kalibr making routing decisions
  • Model selection based on learned outcome patterns
  • Real API calls to OpenAI and Anthropic
  • No artificial failure injection

Results

Phase 1: Baseline Model Performance

Raw performance by model and task type (25 tasks per cell):

Task Type gpt-4o-mini gpt-4o claude-sonnet
gsm8k_math 88% 88% 96%
code_gen 96% 92% 92%
logic_puzzle 80% 88% 84%
long_context 100% 100% 100%

Key finding: Different models excel at different task types. This is the foundation for intelligent routing.

Phase 2: Kalibr-Routed Performance

Metric Baseline Kalibr-Routed Improvement
Overall Success 55.3% 86.7% +57%
Calculation Tasks 36% 96% +167% (2.67×)
Factual Tasks ~60% 90% +50%
Temporal Tasks ~55% 90% +64%
Synthesis Tasks ~50% 64% +28%

The Headline Number

Calculation task failure rates:

  • Without intelligent routing: 64% failure (36% success)
  • With Kalibr routing: 4% failure (96% success)
  • Reduction: 16× fewer failures

We claim "10× fewer failures" because 16× is the upper bound on a specific task type. 10× is conservative and defensible.


Why Calculation Tasks Show Maximum Impact

Calculation tasks expose the largest performance gap between models:

  • Claude Sonnet achieved 96% on GSM8K-style math problems
  • Without routing, the system defaulted to cheaper models that achieved only 36%
  • Kalibr learned this pattern and routed calculation tasks to Claude Sonnet

Limitations

  1. Task-specific gains: The 10× improvement applies to calculation-heavy workloads. Other task types show 1.3-1.6× improvement.
  2. Seeded learning: Some tests used pre-seeded outcome data to accelerate learning. Production systems would require a cold-start learning period.
  3. Model set: Tested with 3 models from 2 providers. Results may vary with different model pools.
  4. Question bank size: 150 unique questions. Larger-scale tests would provide more statistical confidence.

Defensible Claims

Based on this data, the following claims are supported:

  • "Up to 10× fewer failures on routing-sensitive tasks" — Calculation tasks: 64% → 4% failure rate = 16× reduction
  • "Nearly 3× improvement on calculation task success" — 36% → 96% = 2.67×
  • "87% success rate with intelligent routing" — Production-realistic test achieved 86.7%
  • "Kalibr learns which model works best for each task type" — Demonstrated goal-specific routing

Do not claim:

  • ❌ "Agents fail 10× less overall"
  • ❌ "Kalibr reduces failures by 10× in production"
  • ❌ Any universal performance guarantee

Last updated: December 2024 · Questions? Contact team@kalibr.ai