When an agent is running your production pipeline, Kalibr is not just instrumentation. It becomes part of how the agent decides what to do next. This page describes the execution protocol the agent runs, the goal taxonomy it uses to classify requests, and how to wire this into an OpenClaw workspace.
The agent runs this protocol for every task it handles. Not occasionally. Every task.
from kalibr import get_policy, report_outcome
for task in tasks:
goal_id = classify(task) # see taxonomy below
policy = get_policy(goal=goal_id)
output = execute(task, model=policy["recommended_model"])
success, score = evaluate(output, goal_id)
report_outcome(
trace_id=policy["trace_id"],
goal=goal_id,
success=success,
score=score
)The agent uses this table to classify each atomic task. Input type, output type, and cognitive load determine the goal_id. The goal_id determines the default path ordering and the success contract.
Default path ordering is the cold-start order, with cheapest capable model first. Thompson Sampling shifts traffic away from this order based on actual outcomes in your production environment. The order is a starting point, not a fixed rule.
| goal_id | Input to Output | Load | Default path order | Success contract |
|---|---|---|---|---|
| web_scraping | URL to rows | low | DeepSeek, Llama, Mixtral, gpt-4o-mini | field_completeness >= 0.8, min 1 row |
| data_enrichment | rows to rows | low | DeepSeek, Llama, Qwen, gpt-4o-mini | null_rate_after < null_rate_before |
| lead_scoring | text to score | low | DeepSeek, Llama, Mixtral, gpt-4o-mini | score numeric, in [0, 100] |
| classification | text to label | low | DeepSeek, Llama, Qwen, gpt-4o-mini | label in allowed_labels |
| summarization | text to prose | low | DeepSeek, Llama, Mixtral, claude-haiku | compression ratio 0.05 to 0.4 |
| data_pipeline | data to rows | low | DeepSeek, Llama, Qwen, gpt-4o-mini | rows_out > 0, no exception |
| research | text to synthesis | medium | Llama, DeepSeek, deepseek-r1, claude-sonnet | min 200 chars, no error markers |
| outreach_generation | rows to content | medium | Llama, DeepSeek, Mixtral, claude-sonnet | subject + body present, 50-2000 chars |
| code_generation | any to code | high | Sonnet, GPT-4o, o3-mini, deepseek-r1 | AST parse passes or tests_pass = True |
| code_review | code to prose | high | Sonnet, GPT-4o, deepseek-r1, o3-mini | min 50 chars of structured feedback |
| system_design | any to prose | high | Sonnet, deepseek-r1, GPT-4o, o3-mini | min 200 chars of structured output |
| agent_orchestration | multi to coordinates | high | Sonnet, GPT-4o, deepseek-r1, o3-mini | subtasks_completed = True, no timeout |
Conversational replies, status checks, config changes, memory operations, and simple lookups. These carry no signal worth routing. Calling get_policy() and report_outcome() on these adds noise without improving the router.
If you are using OpenClaw as your orchestrator, the setup has two parts: the plugin (so Kalibr can see every call) and the L0 routing directive (so the agent uses the execution protocol above at runtime). Both are required. The plugin alone is observability. The directive is what makes it routing.
openclaw plugins install @kalibr/openclaw openclaw config set plugins.entries.kalibr.config.apiKey "sk_..." openclaw config set plugins.entries.kalibr.config.tenantId "user_..." openclaw config set plugins.entries.kalibr.config.enableRouting true openclaw gateway restart
Paste this block into your AGENTS.md or equivalent workspace context file. It loads every session. The agent reads it at startup and uses it to classify every incoming request, select the routing path, and run the correct eval. Without this block, the plugin instruments calls but the agent routes arbitrarily. With it, the agent is the router.
## Kalibr Routing (L0 -- loads every session)
Step 1 -- Decompose compound requests first.
Before classifying, check: is this one task or multiple?
"Research 20 startups and draft cold emails" = two tasks.
Split, classify, and route each independently.
Never classify a compound request as a single goal.
Step 2 -- Classify by task structure, not topic.
For each atomic task determine:
Input type: URL/data | free text | nothing
Output type: structured rows | prose | code | drafted content
Cognitive load: low | medium | high
Routing table:
URL/data -> rows low DeepSeek first web_scraping, data_enrichment, lead_scoring, classification, data_pipeline
text -> prose low DeepSeek first summarization
text -> prose/synthesis medium Llama first research
rows -> drafted content medium Llama first outreach_generation
anything -> code high Sonnet first code_generation, code_review, system_design
multi-task -> coordinates high Sonnet first agent_orchestration
Step 3 -- Call classify_and_route() per atomic task.
from kalibr_integration.kalibr_pipeline import classify_and_route
output = classify_and_route("Scrape competitor pricing from these 5 sites")
output = classify_and_route("Draft cold email for Series A founders",
context={"system_prompt": "...", "max_tokens": 500})
Goal types and eval contracts:
web_scraping low structural: field_completeness >= 0.8, min 1 row
data_enrichment low structural: null_rate_after < null_rate_before
lead_scoring low structural: score numeric, in [0, 100]
classification low structural: label in allowed_labels
summarization low structural: compression ratio 0.05 to 0.4
data_pipeline low structural: rows_out > 0, no exception
research medium structural: min 200 chars, no error markers + float judge (20% sample, DeepSeek, async)
outreach_generation medium structural: subject + body present, 50-2000 chars + float judge (20% sample, DeepSeek, async)
code_generation high tests: AST parse passes or tests_pass = True
code_review high structural: min 50 chars
system_design high structural: min 200 chars
agent_orchestration high structural: subtasks_completed = True, no timeout
Skip routing for: conversational replies, status checks, config changes,
memory ops, simple lookups. No signal worth routing.