Core Concepts

Kalibr routes each agent request to the lowest-cost model still succeeding for that task. These are the building blocks it uses to make that decision, and to keep improving it over time.

Two ways to use Kalibr

Kalibr works the same way regardless of who is doing the wiring. What changes is who defines the goals, writes the success criteria, and iterates on routing configuration.

Developer-led

You define goals, write success_when callbacks, and call router.report(). Full control. Works well when success criteria are clear up front.

You own: goal definition, success criteria, path selection, outcome interpretation

Agent-orchestrated

A coding agent (Claude Code, OpenClaw, Cursor) acts as the engineer. It defines goals, sets success criteria, instruments code, reads insights, adds paths, and iterates. Outcome reporting burden disappears because the agent handles it.

Agent owns: all of the above. Kalibr owns: routing decisions, statistical learning, degradation detection.

Either way, Kalibr does the same thing: route to the cheapest path still succeeding, detect when paths degrade, and improve over time.

The agent orchestration loop

When an agent is the orchestrator, it runs a continuous improvement loop:

  1. Agent defines goal + paths + success criteria, instruments code with Router
  2. System runs, Kalibr routes each request using Thompson Sampling
  3. Agent calls get_insights() to read failure modes, path performance, and trends
  4. Agent adds new paths, adjusts success criteria, or removes degrading paths based on insights
  5. Loop back to step 2

This also resolves the cold start: the agent monitors sample_count and confidence per path, and proactively adds exploration traffic or new paths when learning is slow.

Goals

A goal is a task with a consistent success criterion.

Good goals:

Bad goals:

Each goal gets its own routing state. Kalibr learns independently for each.

When to create a new goal

When to keep the same goal

Paths

A path is a complete execution configuration. Paths work across any modality: text LLMs, voice models, image generators, embedding models, and any model on HuggingFace.

Just models (Python):

python
paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"]

Just models (TypeScript):

typescript
const paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"];

Model + tool combinations (Python):

python
paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"]},
    {"model": "gpt-4o", "tools": ["google_calendar"]},
    {"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]

Model + tool combinations (TypeScript):

typescript
const paths = [
  { model: "gpt-4o", tools: ["calendar_api"] },
  { model: "gpt-4o", tools: ["google_calendar"] },
  { model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];

Model + tool + parameter combinations (Python):

python
paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]

Model + tool + parameter combinations (TypeScript):

typescript
const paths = [
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];

Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.

Outcomes

An outcome is what you report after execution: success or failure, optionally with a continuous quality score.

Python:

python
# Binary outcome
router.report(success=True)
router.report(success=False, reason="invalid_time")

# Continuous quality score, feeds directly into routing
router.report(success=True, score=0.85)

# Score provides finer signal than binary alone.
# A path scoring 0.85 consistently will be preferred
# over one scoring 0.6, even if both technically "succeed."

TypeScript:

typescript
await router.report(true);
await router.report(false, "invalid_time");
await router.report(true, undefined, 0.85);

Without outcomes, Kalibr can't learn. This is the feedback loop.

What Kalibr tracks per path:

What Kalibr ignores:

Cold start

Kalibr needs outcomes to learn. During the initial phase, routing behaves as follows:

For low-traffic goals this phase can take days. If you are using agent-orchestrated setup, your agent can monitor sample_count and confidence via get_insights() and respond by adding more test traffic or new paths when confidence is low.

The Feedback Loop

Kalibr captures execution telemetry and serves it back as structured intelligence. The full loop:

1. Report outcomes, your agent reports success or failure after each task, optionally with a continuous quality score (0-1) and a structured failure_category (timeout, tool_error, hallucination_detected, etc.). The continuous score feeds directly into Thompson Sampling for finer-grained routing.

2. Kalibr learns, Thompson Sampling updates beliefs about which paths work best, using both binary outcomes and continuous quality scores. Trend detection identifies degradation. Rollback monitoring disables failing paths automatically.

3. Query insights, a coding agent calls get_insights() and receives structured diagnostics: which goals are healthy, which are failing, which failure modes dominate, which paths underperform, which parameters matter.

4. Update outcomes, when real-world signals arrive later (customer reopened ticket 48 hours after "resolution"), update_outcome() corrects the record. Every downstream component learns from the correction.

5. Auto-explore, when enabled, Kalibr automatically generates new path configurations by interpolating parameter values (e.g., trying temperature 0.2 if 0.3 was the best value tested). New paths are evaluated through existing exploration traffic.

The human's role: set goals, define success criteria, own billing, check in occasionally. Everything else is agent-to-agent.

Failure Categories

Instead of free-text failure reasons that can't be aggregated, Kalibr supports structured failure categories. These enable clean clustering: "60% of failures for this goal are timeouts" rather than parsing thousands of unique error strings.

python
from kalibr import FAILURE_CATEGORIES

# timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknown

router.report(success=False, failure_category="timeout",
              reason="Provider timed out after 30s")

Constraints

You can add constraints to routing decisions:

Python:

python
policy = get_policy(
    goal="book_meeting",
    constraints={
        "max_cost_usd": 0.05,
        "max_latency_ms": 2000,
        "min_quality": 0.8
    }
)

TypeScript:

typescript
const policy = await getPolicy("book_meeting", {
  constraints: {
    maxCostUsd: 0.05,
    maxLatencyMs: 2000,
    minQuality: 0.8
  }
});

Kalibr will only recommend paths that meet all constraints.

What Kalibr Doesn't Do

Next