Thompson Sampling for LLM Routing: Why Your Model Selection Should Be Probabilistic
Every routing decision you hardcode is based on your intuition at one point in time. Thompson Sampling lets the outcomes decide, and it gets better with every call. Here's how to build it from scratch for LLM model selection.
Why Static Routing Fails
When you write:
if task_type == "synthesis":
model = "gpt-4o"
else:
model = "gpt-4o-mini"
You're making a bet that your classification will hold, that model performance relationships are stable, and that the routing logic doesn't need to evolve after you deploy. None of those bets age well.
The alternatives most people reach for aren't much better:
- Round-robin: Distributes load but ignores performance. Sends equal traffic to a model that's returning worse results.
- A/B testing: Requires you to define a test, run it, analyze it, redeploy. You're running a static experiment, not adaptive routing.
- Rule-based routing: Better than nothing, but you're still encoding static assumptions.
Thompson Sampling is different. It's a multi-armed bandit algorithm that continuously balances exploration (trying models to gather data) with exploitation (sending traffic to what's working). It updates after every call. It doesn't require a separate analysis phase. And you can implement it in about 50 lines of Python.
What Thompson Sampling Is
Plain English first.
Imagine you have three LLMs. You don't know which performs best for your specific tasks. You have some prior belief, but it's uncertain. Thompson Sampling represents that uncertainty explicitly as a probability distribution over each model's "true" success rate.
At each routing decision, it:
- Draws a sample from each model's distribution, a random guess at that model's success rate
- Routes to the model whose sample is highest
- Observes the outcome (success or failure)
- Updates that model's distribution based on what happened
Over time, the distributions get tighter around each model's true performance. Good models accumulate probability mass at high values. Bad models get narrowed toward low values. But the algorithm never stops exploring entirely, there's always some chance of sampling a high value from an underperforming model and giving it another shot.
Now the math.
Thompson Sampling for binary outcomes (success/failure) uses the Beta distribution. For a model with alpha successes and beta failures, the distribution is Beta(alpha, beta).
The Beta distribution is defined on [0, 1] and represents your belief about a probability. Beta(1, 1) is a flat prior, uniform belief that the success rate could be anything. As you observe outcomes:
- Each success: alpha += 1 (distribution shifts toward higher values)
- Each failure: beta += 1 (distribution shifts toward lower values)
The mean of Beta(alpha, beta) is alpha / (alpha + beta), your best estimate of the success rate. But the distribution also captures uncertainty: with few observations, it's wide and samples are variable. With many observations, it's narrow and samples cluster around the true rate.
This is why Thompson Sampling naturally handles the exploration/exploitation tradeoff. Early on, all distributions are wide and uncertain. You explore more because any model might sample high. Later, the distributions tighten around true performance. You exploit the winner more consistently, but occasionally sample high from alternatives and check if anything changed.
From-Scratch Implementation
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import json
@dataclass
class ModelArm:
"""Represents one LLM as a bandit arm with Beta distribution."""
name: str
alpha: float = 1.0 # prior: 1 success (optimistic start)
beta: float = 1.0 # prior: 1 failure (optimistic start)
def sample(self) -> float:
"""Draw a sample from Beta(alpha, beta). Higher = more likely to be selected."""
return np.random.beta(self.alpha, self.beta)
def record_success(self):
self.alpha += 1
def record_failure(self):
self.beta += 1
@property
def estimated_success_rate(self) -> float:
return self.alpha / (self.alpha + self.beta)
@property
def total_observations(self) -> int:
# subtract the prior (1, 1) to get real observation count
return int(self.alpha + self.beta - 2)
@property
def uncertainty(self) -> float:
"""Variance of the Beta distribution, higher means less certain."""
a, b = self.alpha, self.beta
return (a * b) / ((a + b) ** 2 * (a + b + 1))
class ThompsonRouter:
"""
Multi-armed bandit router for LLM model selection.
Uses Thompson Sampling with Beta-Bernoulli conjugate model.
"""
def __init__(self, models: list[str]):
self.arms: dict[str, ModelArm] = {
name: ModelArm(name=name) for name in models
}
def select_model(self) -> str:
"""Sample from each arm's distribution and select the highest."""
samples = {name: arm.sample() for name, arm in self.arms.items()}
return max(samples, key=samples.get)
def record_outcome(self, model: str, success: bool):
"""Update the selected model's distribution based on outcome."""
if model not in self.arms:
raise ValueError(f"Unknown model: {model}")
if success:
self.arms[model].record_success()
else:
self.arms[model].record_failure()
def get_stats(self) -> dict:
return {
name: {
"estimated_success_rate": round(arm.estimated_success_rate, 4),
"observations": arm.total_observations,
"uncertainty": round(arm.uncertainty, 6),
"alpha": arm.alpha,
"beta": arm.beta
}
for name, arm in self.arms.items()
}
def save_state(self, path: str):
state = {
name: {"alpha": arm.alpha, "beta": arm.beta}
for name, arm in self.arms.items()
}
with open(path, "w") as f:
json.dump(state, f)
def load_state(self, path: str):
with open(path) as f:
state = json.load(f)
for name, params in state.items():
if name in self.arms:
self.arms[name].alpha = params["alpha"]
self.arms[name].beta = params["beta"]
Wire it into an LLM caller:
import openai
import time
client = openai.OpenAI()
def call_with_thompson_routing(
prompt: str,
router: ThompsonRouter,
success_fn = None # optional: custom success evaluation
) -> tuple[str, str, bool]:
"""
Make an LLM call with Thompson Sampling for model selection.
Returns (response_text, model_used, was_success)
"""
model = router.select_model()
start = time.time()
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
timeout=30
)
latency_ms = (time.time() - start) * 1000
content = response.choices[0].message.content
# default success definition: got a response in reasonable time
if success_fn is not None:
success = success_fn(content, latency_ms)
else:
success = latency_ms < 10000 and len(content) > 0
router.record_outcome(model, success)
return content, model, success
except Exception as e:
router.record_outcome(model, False)
raise
# example: 3-model router
router = ThompsonRouter(["gpt-4o", "gpt-4o-mini", "gpt-4-turbo"])
# simulate traffic and watch routing evolve
for i in range(20):
response, model, success = call_with_thompson_routing(
prompt="Summarize this paragraph in one sentence: ...",
router=router
)
print(f"Call {i+1}: routed to {model} | success={success}")
# inspect learned distributions
import json
print(json.dumps(router.get_stats(), indent=2))
Simulating the Algorithm to Build Intuition
You don't need to make real API calls to see how Thompson Sampling behaves. This simulation shows convergence in action:
import numpy as np
import matplotlib.pyplot as plt
def simulate_thompson_routing(
true_success_rates: dict[str, float],
n_trials: int = 500,
seed: int = 42
) -> dict:
"""
Simulate Thompson Sampling routing without actual LLM calls.
true_success_rates: ground truth success rates per model (unknown to algorithm)
"""
np.random.seed(seed)
router = ThompsonRouter(list(true_success_rates.keys()))
selection_counts = {m: 0 for m in true_success_rates}
for trial in range(n_trials):
model = router.select_model()
selection_counts[model] += 1
# simulate outcome based on true success rate
success = np.random.random() < true_success_rates[model]
router.record_outcome(model, success)
# compute final selection fractions
total = sum(selection_counts.values())
selection_fracs = {m: c/total for m, c in selection_counts.items()}
print(f"\nAfter {n_trials} trials:")
print(f"True success rates: {true_success_rates}")
print(f"Selection fractions: {selection_fracs}")
print(f"Learned estimates: {router.get_stats()}")
return router
# scenario: gpt-4o outperforms mini on complex tasks
router = simulate_thompson_routing({
"gpt-4o": 0.85,
"gpt-4o-mini": 0.65,
"gpt-4-turbo": 0.78
})
Run this and you'll see selection fractions converge toward the best model while maintaining exploration. With true rates of 0.85, 0.65, and 0.78, Thompson Sampling will route roughly 70-80% of traffic to gpt-4o after 500 trials, but keep trying the others enough to detect if their rates change.
That last point matters: if gpt-4o-mini's success rate jumps from 0.65 to 0.90 (maybe after an OpenAI update), Thompson Sampling will detect it and reallocate traffic within a few hundred calls. A hardcoded router never would.
Defining "Success" for LLM Routing
The algorithm is only as good as your success signal. Some options:
# latency-based: penalize slow responses
def latency_success(content: str, latency_ms: float) -> bool:
return latency_ms < 5000 and len(content.strip()) > 10
# output quality: check for expected structure
def structured_output_success(content: str, latency_ms: float) -> bool:
import json
try:
parsed = json.loads(content)
return isinstance(parsed, dict) and len(parsed) > 0
except json.JSONDecodeError:
return False
# downstream task outcome: did the agent actually succeed?
# this is the gold standard but requires more instrumentation
def task_outcome_success(content: str, latency_ms: float) -> bool:
# called with result of downstream task, not just LLM response
# e.g., did the search query this model generated return results?
# did the code this model wrote pass tests?
pass # wire to your actual task outcome
Downstream task outcome is the best signal. If your agent extracts a search query and then runs a search, record whether that search returned useful results, and attribute it back to the model that generated the query. That's the outcome that actually matters.
Kalibr's Production Version
The implementation above will get you very far. What it doesn't have: persistence across process restarts, multi-tenant outcome pooling (so your router benefits from outcomes across your whole fleet, not just your single instance), and automatic exploration rate tuning as distributions converge.
import kalibr
import openai
kalibr.init()
client = openai.OpenAI()
# kalibr runs Thompson Sampling for you, with persistent state
# and outcome data pooled across your deployment
policy = kalibr.get_policy(task_context={
"task_type": "synthesis",
"quality_priority": 0.8
})
response = client.chat.completions.create(
model=policy.recommended_model,
messages=[{"role": "user", "content": prompt}]
)
# report outcome back to update the distributions
kalibr.record_outcome(
policy_id=policy.id,
success=True,
latency_ms=response_latency
)
The policy.recommended_model field is Thompson Sampling's output for your task context. The record_outcome call feeds back into the Beta distribution update. Every call makes the next routing decision slightly better.
Key Points
Thompson Sampling for LLM routing is not a complicated idea:
- Represent each model's performance as a Beta distribution
- Sample from each distribution at decision time
- Route to the highest sample
- Update based on outcome
The algorithm's power comes from two properties: it naturally handles exploration/exploitation without any tuning, and it continuously adapts to changing performance without requiring you to notice and redeploy. That's exactly what you want from a model selection layer in a production agent system.
The hardcoded routing decisions you make today will be wrong eventually. Thompson Sampling is wrong too, but it corrects itself.
Kalibr keeps complex AI agents running without human intervention.
Get started free