Resilience Benchmark

When your best execution path degrades, Kalibr routes around it automatically. Hardcoded systems keep failing until a human intervenes.

The Question

What happens when an agent's execution path starts failing in production?

This benchmark compares those two behaviors under identical conditions.

What Makes This Different

This is execution path routing, not model routing.

Each path is a complete execution strategy combining model and tool:

Path IDModelToolDescription
gpt4o-serpergpt-4oSerperPrimary path (hardcoded baseline)
gpt4o-tavilygpt-4oTavilyBackup tool
gpt4o-mini-tavilygpt-4o-miniTavilyCost-optimized backup

The paths differ by tool, not just model. This reflects how real agents are built.

The Agent

A realistic multi-step research agent:

  1. Plan: Generate search queries (LLM call)
  2. Search: Call external API (Serper or Tavily)
  3. Extract: Pull facts with source references (LLM call)
  4. Synthesize: Write answer with citations (LLM call)
  5. Validate: Verify citations reference valid sources

A task succeeds only if all steps complete and validation passes.

Experimental Conditions

Hardcoded Baseline

Kalibr

Phases

PhaseTasksDescription
Learning15Normal operation. No failures injected.
Degraded25Serper fails 70% of requests. Tavily unaffected.
Recovery10Degradation continues. Measure steady state.

Failure Injection

At task 16, a 70% failure rate is injected only on Serper. Tavily remains healthy.

This simulates real-world degradation:

The failure is scoped to one path. Kalibr can route around it. Hardcoded cannot.

Results

Run 1

PhaseHardcodedKalibrDelta
Learning100.0%100.0%+0.0%
Degraded36.0%92.0%+56.0%
Recovery30.0%100.0%+70.0%
Overall54.0%96.0%+42.0%

Run 2

PhaseHardcodedKalibrDelta
Learning100.0%100.0%+0.0%
Degraded16.0%88.0%+72.0%
Recovery20.0%100.0%+80.0%
Overall42.0%94.0%+52.0%

Run 3

PhaseHardcodedKalibrDelta
Learning93.3%100.0%+6.7%
Degraded24.0%88.0%+64.0%
Recovery30.0%100.0%+70.0%
Overall46.0%94.0%+48.0%

Results are consistent across runs.

Path Distribution (Run 3)

PathTasksSuccess Rate
gpt4o-serper540.0%
gpt4o-tavily28100.0%
gpt4o-mini-tavily17100.0%

Kalibr learned that Serper was failing and shifted traffic to Tavily paths.

What This Demonstrates

During normal operation: Both systems perform identically. Kalibr adds no overhead when nothing is wrong.

During degradation:

Hardcoded system:

Kalibr:

This is not an optimization. It is a behavioral difference that hardcoded systems cannot exhibit.

What This Does Not Claim

This benchmark does not demonstrate:

Kalibr is a control system. It routes execution based on what is actually working.

Limitations

Results should not be extrapolated to all workloads. The purpose is to validate adaptive execution path routing under degradation.

Run It Yourself

shell
pip install kalibr openai httpx

export KALIBR_API_KEY=your-key
export KALIBR_TENANT_ID=your-tenant
export OPENAI_API_KEY=your-key
export SERPER_API_KEY=your-key
export TAVILY_API_KEY=your-key

python resilience_benchmark.py

Options:

shell
python resilience_benchmark.py --quick  # ~25 tasks, ~3 min
python resilience_benchmark.py          # ~50 tasks, ~5 min
python resilience_benchmark.py --full   # ~100 tasks, ~10 min

Requirements: ~$0.30 in API usage (standard run), Python 3.10+

Summary

MetricHardcodedKalibr
Success during degradation~20-36%~88-92%
Human intervention requiredYesNo
Code changes requiredYesNo

When execution paths degrade, hardcoded systems fail until humans intervene. Kalibr adapts automatically.

Source Code

The complete benchmark is open source: github.com/kalibr-ai/kalibr-benchmark