System Architecture
How Kalibr works under the hood.
Components
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Python SDK │ │ TS SDK │ │ Framework Integs │ │
│ │ (kalibr) │ │ (@kalibr/sdk│ │ (LangChain/CrewAI) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │
└─────────┼────────────────┼───────────────────┼──────────────┘
│ │ │
└────────────────┼───────────────────┘
│ HTTPS (NDJSON)
▼
┌─────────────────────────────────────────────────────────────┐
│ Kalibr Backend │
│ api.kalibr.systems:443 │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ /api/ingest │ /api/otel/* │ /api/intelligence │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ClickHouse │ │
│ │ (traces, outcomes tables) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ Aggregation
▼
┌─────────────────────────────────────────────────────────────┐
│ Intelligence Service │
│ kalibr-intelligence.fly.dev │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Pattern Engine │ Recommender │ Wilson Scoring │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Redis │ │ ClickHouse │ │
│ │ (cache) │ │ (patterns) │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Data Flow
- SDK captures trace — LLM call metadata (model, tokens, cost, latency)
- NDJSON POST to /api/ingest — Batched events sent to backend
- Backend validates & enriches — Adds pricing, validates schema
- ClickHouse storage — Inserted into traces table
- Pattern aggregation — Every 5 minutes, patterns computed
- Intelligence queries — get_policy() queries patterns + outcomes
- Dashboard display — React frontend queries /api/otel/*
SDKs
Python SDK (kalibr v1.2.0)
| Auto-instrumentation | OpenAI, Anthropic, Google SDKs |
| OpenTelemetry | OTLP export + local JSONL fallback |
| Intelligence | get_policy(), report_outcome() |
| Dependencies | httpx, opentelemetry, tiktoken |
TypeScript SDK (@kalibr/sdk v1.0.0)
| Pattern | SpanBuilder (manual span creation) |
| Dependencies | Zero (uses native fetch) |
| Runtimes | Node.js 18+, Edge, Bun |
| Formats | CJS + ESM dual build |
Backend
Stack: FastAPI, ClickHouse (native protocol), Clerk (auth)
Key Routes
| Route | Purpose |
/api/ingest | Event ingestion (NDJSON/JSON) |
/api/otel/spans | Query spans with filters |
/api/otel/metrics | Aggregated metrics |
/api/intelligence/* | Proxy to intelligence service |
/api/capsules/* | Cross-service trace propagation |
/api/runtimes/* | Runtime registry |
Background Jobs
| Daily aggregation | 00:15 UTC — Compute daily summaries |
| Daily export | 00:30 UTC — Export to JSONL |
Intelligence Service
Deployed: kalibr-intelligence.fly.dev (separate microservice)
Features
- Pattern Engine — Aggregates traces into model performance patterns
- Recommender — Wilson scoring for statistical confidence
- Outcome Tracking — Stores success/failure outcomes
- Pareto Frontier — Identifies optimal cost/quality tradeoffs
API Routes
Base: /api/v1/intelligence
POST /policy | Goal-based model recommendation |
POST /recommend | Task-based model recommendation |
POST /report-outcome | Record outcome feedback |
GET /patterns/{task_type} | Get aggregated patterns |
POST /aggregate | Trigger manual aggregation |
Storage
ClickHouse
Time-series database for traces. Native protocol on port 9000.
traces table
CREATE TABLE kalibr.traces (
event_date Date,
trace_id String,
span_id String,
parent_span_id String,
tenant_id String,
ts_start DateTime64(3),
ts_end DateTime64(3),
duration_ms UInt32,
provider String,
model_id String,
operation String,
input_tokens UInt32,
output_tokens UInt32,
cost_est_usd Float64,
status String,
error_type String,
error_message String,
...
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, ts_start, trace_id)
outcomes table
CREATE TABLE kalibr.outcomes (
trace_id String,
tenant_id String,
goal String,
success UInt8,
score Float64,
failure_reason String,
metadata String,
created_at DateTime DEFAULT now()
) ENGINE = MergeTree()
ORDER BY (tenant_id, goal, created_at)
Deployment
Cloud (Managed)
| Backend | Fly.io |
| Intelligence | Fly.io (separate app) |
| ClickHouse | ClickHouse Cloud |
| Frontend | Vercel |
| Auth | Clerk |
Security
- TLS 1.3 — All connections encrypted
- API Keys — HMAC-verified, stored hashed
- Tenant Isolation — All queries filtered by tenant_id
- Clerk SSO — Dashboard authentication
Next Steps