Core Concepts

Kalibr routes each agent request to the lowest-cost model still succeeding for that task. These are the building blocks it uses to make that decision, and to keep improving it over time.

Two ways to use Kalibr

Kalibr works the same way regardless of who is doing the wiring. What changes is who defines the goals, writes the success criteria, and iterates on routing configuration.

Developer-led

You define goals, write success_when callbacks, and call router.report(). Full control. Works well when success criteria are clear up front.

You own: goal definition, success criteria, path selection, outcome interpretation

Agent-orchestrated

A coding agent (Claude Code, Codex, Cursor) acts as the engineer. It defines goals, sets success criteria, instruments code, reads insights, adds paths, and iterates. Outcome reporting burden disappears because the agent handles it.

Agent owns: all of the above. Kalibr owns: routing decisions, statistical learning, degradation detection.

Either way, Kalibr does the same thing: route to the cheapest path still succeeding, detect when paths degrade, and improve over time.

The agent orchestration loop

When an agent is the orchestrator, it runs a continuous improvement loop:

Agent defines goal + paths + success criteria, instruments code with Router
System runs, Kalibr routes each request based on learned outcomes
Agent calls get_insights() to read failure modes, path performance, and trends
Agent adds new paths, adjusts success criteria, or removes degrading paths based on insights
Loop back to step 2

This also resolves the early learning phase: the agent monitors sample_count and confidence per path, and proactively adds test traffic or new paths when learning is slow.

Goals

A goal is a task with a consistent success criterion.

Good goals:

book_meeting
extract_company
classify_ticket
generate_sql

Bad goals:

handle_request (too vague)
llm_call (no success criterion)

Each goal gets its own routing state. Kalibr learns independently for each.

When to create a new goal

Success criteria change, extract_company vs extract_company_with_domain
Input types differ, summarize_email vs summarize_transcript

When to keep the same goal

Only the input content varies (different emails, same extraction task)
You're testing different prompts for the same task

Paths

A path is a complete execution configuration. Paths work across any modality: text LLMs, voice models, image generators, embedding models, and any model on HuggingFace.

Just models (Python):

python

paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"]

Just models (TypeScript):

typescript

const paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"];

Model + tool combinations (Python):

python

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"]},
    {"model": "gpt-4o", "tools": ["google_calendar"]},
    {"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]

Model + tool combinations (TypeScript):

typescript

const paths = [
  { model: "gpt-4o", tools: ["calendar_api"] },
  { model: "gpt-4o", tools: ["google_calendar"] },
  { model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];

Model + tool + parameter combinations (Python):

python

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]

Model + tool + parameter combinations (TypeScript):

typescript

const paths = [
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];

Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.

Outcomes

An outcome is what you report after execution: success or failure, optionally with a continuous quality score.

Python:

python

# Binary outcome
router.report(success=True)
router.report(success=False, reason="invalid_time")

# Continuous quality score, feeds directly into routing
router.report(success=True, score=0.85)

# Score provides finer signal than binary alone.
# A path scoring 0.85 consistently will be preferred
# over one scoring 0.6, even if both technically "succeed."

TypeScript:

typescript

await router.report(true);
await router.report(false, "invalid_time");
await router.report(true, undefined, 0.85);

Without outcomes, Kalibr can't learn. This is the feedback loop.

What Kalibr tracks per path:

Success rate (binary pass/fail)
Quality score distribution (continuous 0-1, when reported)
Sample count
Trend (improving / stable / degrading)
Cost and latency (from traces)

What Kalibr ignores:

Your prompts
Response content
Anything that could leak sensitive data

Before enough run data is collected

Kalibr needs outcomes to learn. Before enough run data is collected, routing behaves as follows:

0 outcomes: traffic split evenly across all paths
1 to 20 outcomes per path: Kalibr is still learning, no winner declared yet
20+ outcomes per path: Kalibr learns which model works best for each task, traffic shifts to winning paths

For low-traffic goals this phase can take days. If you are using agent-orchestrated setup, your agent can monitor sample_count and confidence via get_insights() and respond by adding more test traffic or new paths when confidence is low.

The Feedback Loop

Kalibr captures execution telemetry and serves it back as structured intelligence. The full loop:

1. Report outcomes, your agent reports success or failure after each task, optionally with a continuous quality score (0-1) and a structured failure_category (timeout, tool_error, hallucination_detected, etc.). The continuous score feeds directly into the routing engine for finer-grained model selection.

2. Kalibr learns, the routing engine updates beliefs about which paths work best, using both binary outcomes and continuous quality scores. Trend detection identifies degradation. Rollback monitoring disables failing paths automatically.

3. Query insights, a coding agent calls get_insights() and receives structured diagnostics: which goals are healthy, which are failing, which failure modes dominate, which paths underperform, which parameters matter.

4. Update outcomes, when real-world signals arrive later (customer reopened ticket 48 hours after "resolution"), update_outcome() corrects the record. Every downstream component learns from the correction.

The human's role: set goals, define success criteria, own billing, check in occasionally. Everything else is agent-to-agent.

Failure Categories

Instead of free-text failure reasons that can't be aggregated, Kalibr supports structured failure categories. These enable clean clustering: "60% of failures for this goal are timeouts" rather than parsing thousands of unique error strings.

python

from kalibr import FAILURE_CATEGORIES

# timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknown

router.report(success=False, failure_category="timeout",
              reason="Provider timed out after 30s")

Constraints

You can add constraints to routing decisions:

Python:

python

policy = get_policy(
    goal="book_meeting",
    constraints={
        "max_cost_usd": 0.05,
        "max_latency_ms": 2000,
        "min_quality": 0.8
    }
)

TypeScript:

typescript

const policy = await getPolicy("book_meeting", {
  constraints: {
    maxCostUsd: 0.05,
    maxLatencyMs: 2000,
    minQuality: 0.8
  }
});

Kalibr will only recommend paths that meet all constraints.

What Kalibr Doesn't Do

Not a proxy: Calls go directly to providers. Kalibr just decides which one.
Not a retry system: If a call fails, it fails. Kalibr learns and routes away next time.
Not eval tooling: Kalibr doesn't judge output quality. You define success.
Not an agent framework: You own your logic. Kalibr only picks the path.

How Kalibr Works. Failure detection, healing, model selection
API Reference. Full Router API
Production Guide. Error handling, monitoring, debugging

Core Concepts

Two ways to use Kalibr

The agent orchestration loop

Goals

When to create a new goal

When to keep the same goal

Paths

Outcomes

Before enough run data is collected

The Feedback Loop

Failure Categories

Constraints

What Kalibr Doesn't Do

Next