When your best execution path degrades, Kalibr routes around it automatically. Hardcoded systems keep failing until a human intervenes.
What happens when an agent's execution path starts failing in production?
This benchmark compares those two behaviors under identical conditions.
This is execution path routing, not model routing.
Each path is a complete execution strategy combining model and tool:
| Path ID | Model | Tool | Description |
|---|---|---|---|
gpt4o-serper | gpt-4o | Serper | Primary path (hardcoded baseline) |
gpt4o-tavily | gpt-4o | Tavily | Backup tool |
gpt4o-mini-tavily | gpt-4o-mini | Tavily | Cost-optimized backup |
The paths differ by tool, not just model. This reflects how real agents are built.
A realistic multi-step research agent:
A task succeeds only if all steps complete and validation passes.
gpt4o-serper| Phase | Tasks | Description |
|---|---|---|
| Learning | 15 | Normal operation. No failures injected. |
| Degraded | 25 | Serper fails 70% of requests. Tavily unaffected. |
| Recovery | 10 | Degradation continues. Measure steady state. |
At task 16, a 70% failure rate is injected only on Serper. Tavily remains healthy.
This simulates real-world degradation:
The failure is scoped to one path. Kalibr can route around it. Hardcoded cannot.
| Phase | Hardcoded | Kalibr | Delta |
|---|---|---|---|
| Learning | 100.0% | 100.0% | +0.0% |
| Degraded | 36.0% | 92.0% | +56.0% |
| Recovery | 30.0% | 100.0% | +70.0% |
| Overall | 54.0% | 96.0% | +42.0% |
| Phase | Hardcoded | Kalibr | Delta |
|---|---|---|---|
| Learning | 100.0% | 100.0% | +0.0% |
| Degraded | 16.0% | 88.0% | +72.0% |
| Recovery | 20.0% | 100.0% | +80.0% |
| Overall | 42.0% | 94.0% | +52.0% |
| Phase | Hardcoded | Kalibr | Delta |
|---|---|---|---|
| Learning | 93.3% | 100.0% | +6.7% |
| Degraded | 24.0% | 88.0% | +64.0% |
| Recovery | 30.0% | 100.0% | +70.0% |
| Overall | 46.0% | 94.0% | +48.0% |
Results are consistent across runs.
| Path | Tasks | Success Rate |
|---|---|---|
gpt4o-serper | 5 | 40.0% |
gpt4o-tavily | 28 | 100.0% |
gpt4o-mini-tavily | 17 | 100.0% |
Kalibr learned that Serper was failing and shifted traffic to Tavily paths.
During normal operation: Both systems perform identically. Kalibr adds no overhead when nothing is wrong.
During degradation:
Hardcoded system:
Kalibr:
This is not an optimization. It is a behavioral difference that hardcoded systems cannot exhibit.
This benchmark does not demonstrate:
Kalibr is a control system. It routes execution based on what is actually working.
Results should not be extrapolated to all workloads. The purpose is to validate adaptive execution path routing under degradation.
pip install kalibr openai httpx export KALIBR_API_KEY=your-key export KALIBR_TENANT_ID=your-tenant export OPENAI_API_KEY=your-key export SERPER_API_KEY=your-key export TAVILY_API_KEY=your-key python resilience_benchmark.py
Options:
python resilience_benchmark.py --quick # ~25 tasks, ~3 min python resilience_benchmark.py # ~50 tasks, ~5 min python resilience_benchmark.py --full # ~100 tasks, ~10 min
Requirements: ~$0.30 in API usage (standard run), Python 3.10+
| Metric | Hardcoded | Kalibr |
|---|---|---|
| Success during degradation | ~20-36% | ~88-92% |
| Human intervention required | Yes | No |
| Code changes required | Yes | No |
When execution paths degrade, hardcoded systems fail until humans intervene. Kalibr adapts automatically.
The complete benchmark is open source: github.com/kalibr-ai/kalibr-benchmark