Agent as Orchestrator

Agent as orchestrator

When an agent is running your production pipeline, Kalibr is not just instrumentation. It becomes part of how the agent decides what to do next. This page describes the execution protocol the agent runs, the goal taxonomy it uses to classify requests, and how to wire this into an OpenClaw workspace.

This path is for agents that are already running a pipeline, not for agents writing code. If you want a coding agent to instrument a codebase, that is Developer integration.

Quickest path: autonomous onboarding (~45 seconds)

Run this in your terminal. Your agent creates the account, provisions credentials, and verifies. Zero human steps after paste.

pip install kalibr
kalibr auth --agent --email you@example.com
kalibr init && kalibr verify

Already have credentials? Skip to Manual setup.

The execution loop

The agent runs this protocol for every task it handles. Not occasionally. Every task.

Task

Classify as

Success contract

"Scrape pricing from these 5 competitor URLs"

web_scraping

URL input, rows output, low load

field_completeness ≥ 0.8, min 1 row

"Draft a cold email for this Series A founder"

outreach_generation

rows input, content output, medium load

subject + body present, 50 to 2000 chars

"Write a Python function to parse this JSON schema"

code_generation

any input, code output, high load

AST parse passes or tests_pass = True

Execution protocol

Decompose compound requests

Before classifying, check: is this one task or multiple? "Research 20 startups and draft cold emails" is two tasks. Split and classify each independently. Never classify a compound request as a single goal. Signal pollution degrades routing quality over time.

Classify to a goal_id

Classify by task structure, not topic. Three dimensions: input type (URL/data, free text, or nothing), output type (structured rows, prose, code, or drafted content), and cognitive load (low, medium, or high). See the taxonomy table below.

Call get_policy() before executing

This returns the recommended model for the goal and a trace_id. Use the recommended model. Never skip this step and never cache the result across requests.

Execute and evaluate

Run the task on the recommended model. Evaluate the output against the success contract for that goal_id. Structural evals fire synchronously. Boolean pass/fail only.

Call report_outcome() every single time

Pass trace_id, goal_id, success (bool), and score (float 0-1). Do this on failure too. Failures are signals. Skipping this step means the router never learns from that run. This is the most commonly missed step.

Code pattern

pipeline.py

from kalibr import get_policy, report_outcome

for task in tasks:
 goal_id = classify(task) # see taxonomy below
 policy = get_policy(goal=goal_id)
 output = execute(task, model=policy["recommended_model"])
 success, score = evaluate(output, goal_id)
 report_outcome(
 trace_id=policy["trace_id"],
 goal=goal_id,
 success=success,
 score=score
 )

Goal taxonomy

The agent uses this table to classify each atomic task. Input type, output type, and cognitive load determine the goal_id. The goal_id determines the default path ordering and the success contract.

Default path ordering is the cold-start order, with cheapest capable model first. Routing shifts based on actual outcomes. The order is a warm-start default, not a fixed rule.

goal_id	Input to Output	Load	Default path order	Success contract
web_scraping	URL to rows	low	DeepSeek, Llama, Mixtral, gpt-4o-mini	field_completeness >= 0.8, min 1 row
data_enrichment	rows to rows	low	DeepSeek, Llama, Qwen, gpt-4o-mini	null_rate_after < null_rate_before
lead_scoring	text to score	low	DeepSeek, Llama, Mixtral, gpt-4o-mini	score numeric, in [0, 100]
classification	text to label	low	DeepSeek, Llama, Qwen, gpt-4o-mini	label in allowed_labels
summarization	text to prose	low	DeepSeek, Llama, Mixtral, claude-haiku	compression ratio 0.05 to 0.4
data_pipeline	data to rows	low	DeepSeek, Llama, Qwen, gpt-4o-mini	rows_out > 0, no exception
research	text to synthesis	medium	Llama, DeepSeek, deepseek-r1, claude-sonnet	min 200 chars, no error markers
outreach_generation	rows to content	medium	Llama, DeepSeek, Mixtral, claude-sonnet	subject + body present, 50-2000 chars
code_generation	any to code	high	Sonnet, GPT-4o, o3-mini, deepseek-r1	AST parse passes or tests_pass = True
code_review	code to prose	high	Sonnet, GPT-4o, deepseek-r1, o3-mini	min 50 chars of structured feedback
system_design	any to prose	high	Sonnet, deepseek-r1, GPT-4o, o3-mini	min 200 chars of structured output
agent_orchestration	multi to coordinates	high	Sonnet, GPT-4o, deepseek-r1, o3-mini	subtasks_completed = True, no timeout

Skip routing for

Conversational replies, status checks, config changes, memory operations, and simple lookups. These carry no signal worth routing. Calling get_policy() and report_outcome() on these adds noise without improving the router.

Eval rules

Structural eval fires synchronously after every task. Boolean pass/fail. Result goes to report_outcome(success=bool).
For research and outreach_generation only: a float quality judge runs on 20% of successful structural evals, asynchronously. Judge model is DeepSeek. It receives the original request and the output and returns a score from 0.0 to 1.0. Scores below 0.6 indicate the output technically answered but is shallow or low quality. This score goes to report_outcome(success=True, score=float) and gives the router a quality signal that Boolean alone cannot provide.
All other task types: Boolean only. No float judge.
score=float for all other tasks should be derived from actual token cost in response.usage, not pre-weighted estimates.
Adaptive sampling: 100% of runs until 50 outcomes per goal_id, then 25%.

OpenClaw setup

If you are using OpenClaw as your orchestrator, the setup has two parts: the plugin (so Kalibr can see every call) and the L0 routing directive (so the agent uses the execution protocol above at runtime). Both are required. The plugin alone is tracing. The directive is what makes it routing.

Quickest path: autonomous onboarding

Run this in your terminal. Your agent does everything else, creates your Kalibr account, provisions credentials, and verifies the setup. ~45 seconds.

terminal

pip install kalibr
kalibr auth --agent --email you@example.com
kalibr init && kalibr verify

Manual setup

Add the L0 routing directive

Paste this block into your AGENTS.md or equivalent workspace context file. It loads every session. The agent reads it at startup and uses it to classify every incoming request, select the routing path, and run the correct eval. Without this block, the plugin instruments calls but the agent routes arbitrarily. With it, the agent is the router.

AGENTS.md -- L0 context block

## Kalibr Routing (L0 -- loads every session)

Step 1 -- Decompose compound requests first.
Before classifying, check: is this one task or multiple?
"Research 20 startups and draft cold emails" = two tasks.
Split, classify, and route each independently.
Never classify a compound request as a single goal.

Step 2 -- Classify by task structure, not topic.
For each atomic task determine:
 Input type: URL/data | free text | nothing
 Output type: structured rows | prose | code | drafted content
 Cognitive load: low | medium | high

Routing table:
 URL/data -> rows low DeepSeek first web_scraping, data_enrichment, lead_scoring, classification, data_pipeline
 text -> prose low DeepSeek first summarization
 text -> prose/synthesis medium Llama first research
 rows -> drafted content medium Llama first outreach_generation
 anything -> code high Sonnet first code_generation, code_review, system_design
 multi-task -> coordinates high Sonnet first agent_orchestration

Step 3 -- Call classify_and_route() per atomic task.
from kalibr_integration.kalibr_pipeline import classify_and_route

output = classify_and_route("Scrape competitor pricing from these 5 sites")
output = classify_and_route("Draft cold email for Series A founders",
 context={"system_prompt": "...", "max_tokens": 500})

Goal types and eval contracts:
web_scraping low structural: field_completeness >= 0.8, min 1 row
data_enrichment low structural: null_rate_after < null_rate_before
lead_scoring low structural: score numeric, in [0, 100]
classification low structural: label in allowed_labels
summarization low structural: compression ratio 0.05 to 0.4
data_pipeline low structural: rows_out > 0, no exception
research medium structural: min 200 chars, no error markers + float judge (20% sample, DeepSeek, async)
outreach_generation medium structural: subject + body present, 50-2000 chars + float judge (20% sample, DeepSeek, async)
code_generation high tests: AST parse passes or tests_pass = True
code_review high structural: min 50 chars
system_design high structural: min 200 chars
agent_orchestration high structural: subtasks_completed = True, no timeout

Skip routing for: conversational replies, status checks, config changes,
memory ops, simple lookups. No signal worth routing.

Goal taxonomy reference

How Kalibr works