Path 2

Agent as orchestrator

When an agent is running your production pipeline, Kalibr is not just instrumentation. It becomes part of how the agent decides what to do next. This page describes the execution protocol the agent runs, the goal taxonomy it uses to classify requests, and how to wire this into an OpenClaw workspace.

i
This path is for agents that are already running a pipeline, not for agents writing code. If you want a coding agent to instrument a codebase, that is Path 1.

The execution loop

The agent runs this protocol for every task it handles. Not occasionally. Every task.

Task
Classify as
Success contract
"Scrape pricing from these 5 competitor URLs"
web_scraping
URL input, rows output, low load
field_completeness ≥ 0.8, min 1 row
"Draft a cold email for this Series A founder"
outreach_generation
rows input, content output, medium load
subject + body present, 50 to 2000 chars
"Write a Python function to parse this JSON schema"
code_generation
any input, code output, high load
AST parse passes or tests_pass = True
Execution protocol
01
Decompose compound requests
Before classifying, check: is this one task or multiple? "Research 20 startups and draft cold emails" is two tasks. Split and classify each independently. Never classify a compound request as a single goal. Signal pollution degrades routing quality over time.
02
Classify to a goal_id
Classify by task structure, not topic. Three dimensions: input type (URL/data, free text, or nothing), output type (structured rows, prose, code, or drafted content), and cognitive load (low, medium, or high). See the taxonomy table below.
03
Call get_policy() before executing
This returns the recommended model for the goal and a trace_id. Use the recommended model. Never skip this step and never cache the result across requests.
04
Execute and evaluate
Run the task on the recommended model. Evaluate the output against the success contract for that goal_id. Structural evals fire synchronously. Boolean pass/fail only.
05
Call report_outcome() every single time
Pass trace_id, goal_id, success (bool), and score (float 0-1). Do this on failure too. Failures are signals. Skipping this step means the router never learns from that run. This is the most commonly missed step.

Code pattern

pipeline.py
from kalibr import get_policy, report_outcome

for task in tasks:
    goal_id        = classify(task)          # see taxonomy below
    policy         = get_policy(goal=goal_id)
    output         = execute(task, model=policy["recommended_model"])
    success, score = evaluate(output, goal_id)
    report_outcome(
        trace_id=policy["trace_id"],
        goal=goal_id,
        success=success,
        score=score
    )

Goal taxonomy

The agent uses this table to classify each atomic task. Input type, output type, and cognitive load determine the goal_id. The goal_id determines the default path ordering and the success contract.

Default path ordering is the cold-start order, with cheapest capable model first. Thompson Sampling shifts traffic away from this order based on actual outcomes in your production environment. The order is a starting point, not a fixed rule.

goal_idInput to OutputLoadDefault path orderSuccess contract
web_scrapingURL to rowslowDeepSeek, Llama, Mixtral, gpt-4o-minifield_completeness >= 0.8, min 1 row
data_enrichmentrows to rowslowDeepSeek, Llama, Qwen, gpt-4o-mininull_rate_after < null_rate_before
lead_scoringtext to scorelowDeepSeek, Llama, Mixtral, gpt-4o-miniscore numeric, in [0, 100]
classificationtext to labellowDeepSeek, Llama, Qwen, gpt-4o-minilabel in allowed_labels
summarizationtext to proselowDeepSeek, Llama, Mixtral, claude-haikucompression ratio 0.05 to 0.4
data_pipelinedata to rowslowDeepSeek, Llama, Qwen, gpt-4o-minirows_out > 0, no exception
researchtext to synthesismediumLlama, DeepSeek, deepseek-r1, claude-sonnetmin 200 chars, no error markers
outreach_generationrows to contentmediumLlama, DeepSeek, Mixtral, claude-sonnetsubject + body present, 50-2000 chars
code_generationany to codehighSonnet, GPT-4o, o3-mini, deepseek-r1AST parse passes or tests_pass = True
code_reviewcode to prosehighSonnet, GPT-4o, deepseek-r1, o3-minimin 50 chars of structured feedback
system_designany to prosehighSonnet, deepseek-r1, GPT-4o, o3-minimin 200 chars of structured output
agent_orchestrationmulti to coordinateshighSonnet, GPT-4o, deepseek-r1, o3-minisubtasks_completed = True, no timeout

Skip routing for

Conversational replies, status checks, config changes, memory operations, and simple lookups. These carry no signal worth routing. Calling get_policy() and report_outcome() on these adds noise without improving the router.

Eval rules

OpenClaw setup

If you are using OpenClaw as your orchestrator, the setup has two parts: the plugin (so Kalibr can see every call) and the L0 routing directive (so the agent uses the execution protocol above at runtime). Both are required. The plugin alone is observability. The directive is what makes it routing.

Install the plugin

terminal
openclaw plugins install @kalibr/openclaw
openclaw config set plugins.entries.kalibr.config.apiKey    "sk_..."
openclaw config set plugins.entries.kalibr.config.tenantId  "user_..."
openclaw config set plugins.entries.kalibr.config.enableRouting true
openclaw gateway restart

Add the L0 routing directive

Paste this block into your AGENTS.md or equivalent workspace context file. It loads every session. The agent reads it at startup and uses it to classify every incoming request, select the routing path, and run the correct eval. Without this block, the plugin instruments calls but the agent routes arbitrarily. With it, the agent is the router.

AGENTS.md -- L0 context block
## Kalibr Routing (L0 -- loads every session)

Step 1 -- Decompose compound requests first.
Before classifying, check: is this one task or multiple?
"Research 20 startups and draft cold emails" = two tasks.
Split, classify, and route each independently.
Never classify a compound request as a single goal.

Step 2 -- Classify by task structure, not topic.
For each atomic task determine:
  Input type:     URL/data | free text | nothing
  Output type:    structured rows | prose | code | drafted content
  Cognitive load: low | medium | high

Routing table:
  URL/data   -> rows            low    DeepSeek first   web_scraping, data_enrichment, lead_scoring, classification, data_pipeline
  text       -> prose           low    DeepSeek first   summarization
  text       -> prose/synthesis medium Llama first      research
  rows       -> drafted content medium Llama first      outreach_generation
  anything   -> code            high   Sonnet first     code_generation, code_review, system_design
  multi-task -> coordinates     high   Sonnet first     agent_orchestration

Step 3 -- Call classify_and_route() per atomic task.
from kalibr_integration.kalibr_pipeline import classify_and_route

output = classify_and_route("Scrape competitor pricing from these 5 sites")
output = classify_and_route("Draft cold email for Series A founders",
                             context={"system_prompt": "...", "max_tokens": 500})

Goal types and eval contracts:
web_scraping          low    structural: field_completeness >= 0.8, min 1 row
data_enrichment       low    structural: null_rate_after < null_rate_before
lead_scoring          low    structural: score numeric, in [0, 100]
classification        low    structural: label in allowed_labels
summarization         low    structural: compression ratio 0.05 to 0.4
data_pipeline         low    structural: rows_out > 0, no exception
research              medium structural: min 200 chars, no error markers + float judge (20% sample, DeepSeek, async)
outreach_generation   medium structural: subject + body present, 50-2000 chars + float judge (20% sample, DeepSeek, async)
code_generation       high   tests: AST parse passes or tests_pass = True
code_review           high   structural: min 50 chars
system_design         high   structural: min 200 chars
agent_orchestration   high   structural: subtasks_completed = True, no timeout

Skip routing for: conversational replies, status checks, config changes,
memory ops, simple lookups. No signal worth routing.

Next

Goal taxonomy reference

How routing works