FAQ

Short answers to common questions.

"Why not just hardcode the best model?"

Model performance varies by task. Kalibr learns what works for your specific goals, not average performance across other systems.

"How is this different from LangSmith/Langfuse?"

LangSmith and Langfuse are logging and tracing tools. They show you what happened. Kalibr acts on what happened. It changes which model handles the next request. Dashboards without action just give you more things to look at.

"What if my success criteria change?"

Create a new goal. Goals are namespaced. extract_company_v1 and extract_company_v2 have separate routing. Old outcomes don't contaminate new criteria.

"Does this add latency?"

One HTTP call to decide() before each completion. Typically 10-50ms (measured from US regions; latency may be higher in APAC or EU). If the intelligence service is slow or down, Kalibr falls back to your first path immediately. p99 latency under normal load is typically under 120ms from US regions. Latency is higher during cold-start (first call for a new goal) while path registration completes.

"Are routing priors shared between customers?"

No. Priors are scoped per tenant per goal. Your routing data is never shared with other tenants. Global priors (used only for warm-start before a tenant accumulates their own data) are computed from anonymized, aggregated outcome distributions and do not include prompt content, output content, or any identifying data.

"What happens if Kalibr is down?"

Your agent keeps running. Router falls back to the first path in your list. Outcomes aren't recorded until service recovers, but your users don't see errors.

"How long until routing is stable?"

After ~20-50 outcomes per path, Kalibr has enough data to route confidently. Before that, expect more variation as Kalibr is still learning.

"Can I use this with LangChain?"

Yes. Install pip install kalibr[langchain] and use router.as_langchain() to get a LangChain-compatible chat model.

Python

from kalibr import Router

router = Router(goal="summarize", paths=["gpt-4o", "claude-sonnet-4-20250514"])
llm = router.as_langchain()

chain = prompt | llm | parser

LangChain integration is Python-only. For TypeScript, use the Router directly or auto-instrumentation:

TypeScript

import { Router } from '@kalibr/sdk';

const router = new Router({
  goal: 'summarize',
  paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
});

// Use router.completion() directly
const response = await router.completion(messages);
await router.report(true);

"Does Kalibr see my prompts or responses?"

No. Kalibr sees: which model was called, usage metrics (tokens, audio duration, image count), cost, latency, and success/failure. Your actual prompts and responses go directly to the LLM provider. Exception: when repair_prompt=True or healing=True, the SDK sends a secondary call using your own API key directly to your provider — the content never passes through Kalibr’s servers.

"When should I use success_when vs manual report()?"

Use success_when for simple output validation

Output length, contains "@", starts with "{"

Python

router = Router(
    goal="extract",
    paths=["gpt-4o"],
    success_when=lambda output: len(output) > 0
)

Use manual report() for complex validation

Parsing JSON and validating structure
API calls to check results
Multi-step workflows

Python

result = json.loads(response.choices[0].message.content)
is_valid = validate_schema(result)
router.report(success=is_valid, reason=None if is_valid else "invalid_schema")

Use successWhen for simple output validation (TypeScript)

Output length, contains "@", starts with "{"

TypeScript

const router = new Router({
  goal: 'extract',
  paths: ['gpt-4o'],
  successWhen: (output) => output.length > 0,
});

Use manual report() for complex validation (TypeScript)

Parsing JSON and validating structure
API calls to check results
Multi-step workflows

TypeScript

const result = JSON.parse(response.choices[0].message.content);
const isValid = validateSchema(result);
await router.report(isValid, isValid ? undefined : 'invalid_schema');

"How do I handle multi-turn conversations?"

Use force_model to keep same model across turns:

Python

response1 = router.completion(messages=[...])
model = response1.model
response2 = router.completion(messages=[...], force_model=model)
router.report(success=issue_resolved)

Use forceModel to keep same model across turns:

TypeScript

const response1 = await router.completion(messages);
const model = response1.model;
const response2 = await router.completion(messages, { forceModel: model });
await router.report(issueResolved);

"Can I use Router in async/concurrent code?"

Router is not thread-safe. Create separate Router instances per thread/task.

Python

# Each thread gets its own router
def worker():
    router = Router(goal="extract", paths=[...])
    router.completion(...)
    router.report(success=True)

Router is not thread-safe. In serverless/edge functions, create a new Router per request. In long-running Node.js apps, create separate instances per async context.

TypeScript

// Each request handler gets its own router
app.post('/api/extract', async (req, res) => {
  const router = new Router({ goal: 'extract', paths: [...] });
  const response = await router.completion(req.body.messages);
  await router.report(true);
  res.json(response);
});

"What if I change my success criteria?"

Create a new goal with version suffix:

Python

router = Router(goal="extract_company_v2", paths=[...])

TypeScript

const router = new Router({ goal: 'extract_company_v2', paths: [...] });

"How do I route between different temperatures?"

Python

router = Router(
    goal="creative_writing",
    paths=[
        {"model": "gpt-4o", "params": {"temperature": 0.3}},
        {"model": "gpt-4o", "params": {"temperature": 0.9}}
    ]
)

TypeScript

const router = new Router({
  goal: 'creative_writing',
  paths: [
    { model: 'gpt-4o', params: { temperature: 0.3 } },
    { model: 'gpt-4o', params: { temperature: 0.9 } },
  ],
});

Is this just A/B testing?

It's similar in spirit but different in execution. Traditional A/B testing:

Runs for a fixed period
Requires manual analysis
Needs a deploy to change allocation
Tests one thing at a time

Kalibr:

Runs continuously
Adapts automatically based on outcomes
No deploys needed to shift traffic
Tests multiple paths simultaneously (model x tool x params)

Think of it as A/B testing that never ends and deploys itself.

How is this different from OpenRouter or Portkey?

They optimize for metrics they can measure: cost, latency, uptime. Kalibr optimizes for your definition of success.

The litmus test: If a model starts returning syntactically valid but semantically wrong answers for three days, will the router notice?

OpenRouter: No. It only sees HTTP 200s and latency.
Portkey: No. It routes by rules you configure, not outcomes.
Kalibr: Yes. You report failures, Kalibr learns, traffic shifts away.

Other routers: "This model is cheapest and fastest"
Kalibr: "This model actually works for your task"

How is this different from LangSmith?

LangSmith is logging and tracing. It shows you what happened. You look at dashboards, notice problems, then manually change your code.

Kalibr is autonomous optimization. It changes what happens next without human intervention.

LangSmith: "Here's a dashboard showing gpt-4o failed 20% of the time yesterday. You should probably do something about that."
Kalibr: "gpt-4o started failing more. Traffic automatically shifted to Claude. You didn't have to do anything."

Logging tells you there's a problem. Kalibr fixes it.

What if Kalibr goes down?

The SDK falls back to the first path in your list. Your application keeps working. It just loses the optimization benefits until Kalibr recovers.

We designed for this explicitly. Kalibr should never be a single point of failure for your agent.

What's the minimum traffic needed?

Kalibr needs enough outcomes to learn. Rough guidelines:

< 10 outcomes/day per goal: Too low, routing will be mostly random
10-50 outcomes/day: Learning happens, but slowly
50+ outcomes/day: Meaningful optimization

If you have very low traffic for a goal, consider using get_policy() with a longer time window, or just hardcode the path.

Does Kalibr see my prompts or responses?

No. Kalibr tracks:

Which path was used (model, tool, params)
Whether it succeeded or failed (what you report)
Cost and latency (from traces)

Kalibr does not track:

Prompt content
Response content
User data

Your prompts and completions go directly to providers. Kalibr only sees metadata.

Exception — repair_prompt=True / healing=True: When prompt repair is active, the SDK constructs a repair instruction from your output (locally, in-process) and sends a secondary call directly to your configured judge_model using your own API key. The repaired prompt content is not sent to Kalibr’s servers — it goes directly from your environment to your provider. Kalibr only receives the outcome signal (success/fail + score) once the repaired call completes.

What failure categories are available?

Thirteen structured categories: timeout, context_exceeded, tool_error, rate_limited, validation_failed, hallucination_detected, user_unsatisfied, empty_response, malformed_output, auth_error, provider_error, healed, unknown.

Use the FAILURE_CATEGORIES constant from the SDK for client-side validation:

Python

from kalibr import FAILURE_CATEGORIES
router.report(success=False, failure_category="timeout")

Can I update an outcome after reporting it?

Yes. Use update_outcome() to correct outcomes when real-world signals arrive later. Only fields you explicitly pass are updated, everything else keeps its original value.

Python

from kalibr import update_outcome
update_outcome(trace_id="abc123", goal="resolve_ticket",
               success=False, failure_category="user_unsatisfied")

How do I programmatically check what Kalibr has learned?

Call get_insights() to get structured diagnostics per goal: health status, failure modes, path comparisons, parameter sensitivity, and actionable signals. Designed for coding agents that need to decide what to improve.

Python

from kalibr import get_insights
insights = get_insights(goal="resolve_ticket")
for signal in insights["goals"][0]["actionable_signals"]:
    print(signal["type"], signal["severity"])

Does Kalibr automatically try new configurations?

When the auto_path_generation feature flag is enabled, yes. A background job runs hourly, identifies parameters at the boundary of explored space, and generates new paths with interpolated values. New paths are evaluated through existing exploration traffic. Underperformers are cleaned up automatically after 30+ samples.

FAQ

"Why not just hardcode the best model?"

"How is this different from LangSmith/Langfuse?"

"What if my success criteria change?"

"Does this add latency?"

"Are routing priors shared between customers?"

"What happens if Kalibr is down?"

"How long until routing is stable?"

"Can I use this with LangChain?"

"Does Kalibr see my prompts or responses?"

"When should I use success_when vs manual report()?"

Use success_when for simple output validation

Use manual report() for complex validation

Use successWhen for simple output validation (TypeScript)

Use manual report() for complex validation (TypeScript)

"How do I handle multi-turn conversations?"

"Can I use Router in async/concurrent code?"

"What if I change my success criteria?"

"How do I route between different temperatures?"

Is this just A/B testing?

How is this different from OpenRouter or Portkey?

How is this different from LangSmith?

What if Kalibr goes down?

What's the minimum traffic needed?

Does Kalibr see my prompts or responses?

What failure categories are available?

Can I update an outcome after reporting it?

How do I programmatically check what Kalibr has learned?

Does Kalibr automatically try new configurations?

More questions?