System Architecture Overview

Kalibr's architecture provides full-stack observability and cost tracking for multi-agent systems. This document outlines the internal components, data flow, and deployment topology.

Component Overview

1. Kalibr SDK

Location: /sdk/python/kalibr/

Purpose: Zero-config instrumentation for AI SDKs (OpenAI, Anthropic, Google)

Key Modules:

  • instrumentation/ — Provider-specific monkey patches
  • simple_tracer.py — Core tracing utilities
  • collector.py — OTel span collector
  • trace_capsule.py — Cross-service context propagation

2. Local JSONL Buffer

Location: /tmp/kalibr_otel_spans.jsonl

Purpose: Temporary span storage prior to ingestion

Append-only writes, lock-free, high-speed buffering. Safe fallback when ClickHouse is unavailable.

3. OTel Bridge Service

Location: /backend/collectors/otel_bridge.py

Purpose: Syncs JSONL spans into ClickHouse for persistence and analytics

Performance:

  • Sync interval: 1s (configurable)
  • Batch size: 100 spans
  • Throughput: ~79,000 spans/sec

4. ClickHouse (Primary Data Store)

Ports: 9000 (native), 8123 (HTTP)

Purpose: Analytical storage for high-volume traces

Optimizations:

  • Columnar compression (~10x)
  • Time-based partitioning
  • Index granularity: 8192
  • P95 latency: ~230ms on 25K+ spans

5. MongoDB (Metadata Store)

Port: 27017

Purpose: Stores runtime, alert, and user metadata

Collections:

  • runtimes — agent/service registration
  • alerts — rule definitions
  • users — dashboard accounts

6. FastAPI Backend

Port: 8001

Purpose: Unified REST API for querying traces and metrics

Routers:

  • /api/otel/* — Trace queries + metrics
  • /api/v1/* — Legacy endpoints
  • /api/health — System health check

7. React Dashboard

Port: 3000

Purpose: Interactive visualization of cost + behavior metrics

Pages:

  • Spans Viewer — Span table + filters
  • Cost Dashboard — Provider and model spend analytics
  • Performance — Duration histograms, latency metrics
  • Comparison — Model vs vendor benchmarking

Data Flow

  1. Application imports kalibr SDK
  2. SDK auto-instruments OpenAI/Anthropic/Google SDKs
  3. Each LLM call creates an OpenTelemetry span
  4. Spans are written to /tmp/kalibr_otel_spans.jsonl
  5. OTel Bridge reads JSONL file every 1 second
  6. Bridge batch-inserts spans into ClickHouse
  7. Backend API queries ClickHouse for traces/metrics
  8. Dashboard polls API every 5 seconds for updates

Deployment

Development (Docker Compose)

docker compose up -d

Minimum Requirements:

  • CPU: 2–4 cores
  • RAM: 4 GB
  • Disk: 10 GB

Production (Kubernetes)

Kalibr Enterprise provides fully managed Kubernetes deployment with:

  • Backend API (3 replicas, load-balanced)
  • OTel Bridge (1 replica, high priority)
  • ClickHouse (3 shards, replicated)
  • MongoDB (3 replicas)
  • Ingress routing for /api/* → backend, /* → frontend

Security

  • In Transit: HTTPS/TLS for all APIs, mTLS between internal services
  • At Rest: ClickHouse + Mongo encrypted storage
  • Auth: API keys (X-API-Key) for production, JWT dashboard auth (planned)
  • Isolation: Tenant isolation enforced at query level