Short answers to common questions.
Model performance varies by task. Kalibr learns what works for your specific goals, not average performance across other systems.
LangSmith and Langfuse are observability tools. They show you what happened. Kalibr acts on what happened-it changes which model handles the next request. Observability without action is just dashboards.
Create a new goal. Goals are namespaced. extract_company_v1 and extract_company_v2 have separate routing. Old outcomes don't contaminate new criteria.
One HTTP call to decide() before each completion. Typically 10-50ms. If the intelligence service is slow or down, Kalibr falls back to your first path immediately.
Your agent keeps running. Router falls back to the first path in your list. Outcomes aren't recorded until service recovers, but your users don't see errors.
After ~20-50 outcomes per path, Kalibr has enough data to exploit confidently. Before that, expect more exploration.
Yes. Install pip install kalibr[langchain] and use router.as_langchain() to get a LangChain-compatible chat model.
from kalibr import Router router = Router(goal="summarize", paths=["gpt-4o", "claude-sonnet-4-20250514"]) llm = router.as_langchain() chain = prompt | llm | parser
LangChain integration is Python-only. For TypeScript, use the Router directly or auto-instrumentation:
import { Router } from '@kalibr/sdk';
const router = new Router({
goal: 'summarize',
paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
});
// Use router.completion() directly
const response = await router.completion(messages);
await router.report(true);
No. Kalibr sees: which model was called, usage metrics (tokens, audio duration, image count), cost, latency, and success/failure. Your actual prompts and responses go directly to the LLM provider.
router = Router(
goal="extract",
paths=["gpt-4o"],
success_when=lambda output: len(output) > 0
)
result = json.loads(response.choices[0].message.content) is_valid = validate_schema(result) router.report(success=is_valid, reason=None if is_valid else "invalid_schema")
const router = new Router({
goal: 'extract',
paths: ['gpt-4o'],
successWhen: (output) => output.length > 0,
});
const result = JSON.parse(response.choices[0].message.content); const isValid = validateSchema(result); await router.report(isValid, isValid ? undefined : 'invalid_schema');
Use force_model to keep same model across turns:
response1 = router.completion(messages=[...]) model = response1.model response2 = router.completion(messages=[...], force_model=model) router.report(success=issue_resolved)
Use forceModel to keep same model across turns:
const response1 = await router.completion(messages);
const model = response1.model;
const response2 = await router.completion(messages, { forceModel: model });
await router.report(issueResolved);
Router is not thread-safe. Create separate Router instances per thread/task.
# Each thread gets its own router
def worker():
router = Router(goal="extract", paths=[...])
router.completion(...)
router.report(success=True)
Router is not thread-safe. In serverless/edge functions, create a new Router per request. In long-running Node.js apps, create separate instances per async context.
// Each request handler gets its own router
app.post('/api/extract', async (req, res) => {
const router = new Router({ goal: 'extract', paths: [...] });
const response = await router.completion(req.body.messages);
await router.report(true);
res.json(response);
});
Create a new goal with version suffix:
router = Router(goal="extract_company_v2", paths=[...])
const router = new Router({ goal: 'extract_company_v2', paths: [...] });
router = Router(
goal="creative_writing",
paths=[
{"model": "gpt-4o", "params": {"temperature": 0.3}},
{"model": "gpt-4o", "params": {"temperature": 0.9}}
]
)
const router = new Router({
goal: 'creative_writing',
paths: [
{ model: 'gpt-4o', params: { temperature: 0.3 } },
{ model: 'gpt-4o', params: { temperature: 0.9 } },
],
});
It's similar in spirit but different in execution. Traditional A/B testing:
Kalibr:
Think of it as A/B testing that never ends and deploys itself.
They optimize for metrics they can measure: cost, latency, uptime. Kalibr optimizes for your definition of success.
The litmus test: If a model starts returning syntactically valid but semantically wrong answers for three days, will the router notice?
Other routers: "This model is cheapest and fastest"
Kalibr: "This model actually works for your task"
LangSmith is observability - it shows you what happened. You look at dashboards, notice problems, then manually change your code.
Kalibr is autonomous optimization - it changes what happens next without human intervention.
LangSmith: "Here's a dashboard showing gpt-4o failed 20% of the time yesterday. You should probably do something about that."
Kalibr: "gpt-4o started failing more. Traffic automatically shifted to Claude. You didn't have to do anything."
Observability tells you there's a problem. Kalibr fixes it.
The SDK falls back to the first path in your list. Your application keeps working - it just loses the optimization benefits until Kalibr recovers.
We designed for this explicitly. Kalibr should never be a single point of failure for your agent.
Kalibr needs enough outcomes to learn. Rough guidelines:
If you have very low traffic for a goal, consider using get_policy() with a longer time window, or just hardcode the path.
No. Kalibr tracks:
Kalibr does not track:
Your prompts and completions go directly to providers. Kalibr only sees metadata.
Twelve structured categories: timeout, context_exceeded, tool_error, rate_limited, validation_failed, hallucination_detected, user_unsatisfied, empty_response, malformed_output, auth_error, provider_error, unknown.
Use the FAILURE_CATEGORIES constant from the SDK for client-side validation:
from kalibr import FAILURE_CATEGORIES router.report(success=False, failure_category="timeout")
Yes. Use update_outcome() to correct outcomes when real-world signals arrive later. Only fields you explicitly pass are updated, everything else keeps its original value.
from kalibr import update_outcome
update_outcome(trace_id="abc123", goal="resolve_ticket",
success=False, failure_category="user_unsatisfied")
Call get_insights() to get structured diagnostics per goal: health status, failure modes, path comparisons, parameter sensitivity, and actionable signals. Designed for coding agents that need to decide what to improve.
from kalibr import get_insights
insights = get_insights(goal="resolve_ticket")
for signal in insights["goals"][0]["actionable_signals"]:
print(signal["type"], signal["severity"])
When the auto_path_generation feature flag is enabled, yes. A background job runs hourly, identifies parameters at the boundary of explored space, and generates new paths with interpolated values. New paths are evaluated through existing exploration traffic. Underperformers are cleaned up automatically after 30+ samples.
Email support@kalibr.systems or check the dashboard.