April 2026

Thompson Sampling for LLM Routing: Why Your Model Selection Should Be Probabilistic

Every routing decision you hardcode is based on your intuition at one point in time. Thompson Sampling lets the outcomes decide, and it gets better with every call. Here's how to build it from scratch for LLM model selection.

Why Static Routing Fails

When you write:

if task_type == "synthesis":
    model = "gpt-4o"
else:
    model = "gpt-4o-mini"

You're making a bet that your classification will hold, that model performance relationships are stable, and that the routing logic doesn't need to evolve after you deploy. None of those bets age well.

The alternatives most people reach for aren't much better:

Thompson Sampling is different. It's a multi-armed bandit algorithm that continuously balances exploration (trying models to gather data) with exploitation (sending traffic to what's working). It updates after every call. It doesn't require a separate analysis phase. And you can implement it in about 50 lines of Python.


What Thompson Sampling Is

Plain English first.

Imagine you have three LLMs. You don't know which performs best for your specific tasks. You have some prior belief, but it's uncertain. Thompson Sampling represents that uncertainty explicitly as a probability distribution over each model's "true" success rate.

At each routing decision, it:

  1. Draws a sample from each model's distribution, a random guess at that model's success rate
  2. Routes to the model whose sample is highest
  3. Observes the outcome (success or failure)
  4. Updates that model's distribution based on what happened

Over time, the distributions get tighter around each model's true performance. Good models accumulate probability mass at high values. Bad models get narrowed toward low values. But the algorithm never stops exploring entirely, there's always some chance of sampling a high value from an underperforming model and giving it another shot.

Now the math.

Thompson Sampling for binary outcomes (success/failure) uses the Beta distribution. For a model with alpha successes and beta failures, the distribution is Beta(alpha, beta).

The Beta distribution is defined on [0, 1] and represents your belief about a probability. Beta(1, 1) is a flat prior, uniform belief that the success rate could be anything. As you observe outcomes:

The mean of Beta(alpha, beta) is alpha / (alpha + beta), your best estimate of the success rate. But the distribution also captures uncertainty: with few observations, it's wide and samples are variable. With many observations, it's narrow and samples cluster around the true rate.

This is why Thompson Sampling naturally handles the exploration/exploitation tradeoff. Early on, all distributions are wide and uncertain. You explore more because any model might sample high. Later, the distributions tighten around true performance. You exploit the winner more consistently, but occasionally sample high from alternatives and check if anything changed.


From-Scratch Implementation

import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class ModelArm:
    """Represents one LLM as a bandit arm with Beta distribution."""
    name: str
    alpha: float = 1.0  # prior: 1 success (optimistic start)
    beta: float = 1.0   # prior: 1 failure (optimistic start)
    
    def sample(self) -> float:
        """Draw a sample from Beta(alpha, beta). Higher = more likely to be selected."""
        return np.random.beta(self.alpha, self.beta)
    
    def record_success(self):
        self.alpha += 1
    
    def record_failure(self):
        self.beta += 1
    
    @property
    def estimated_success_rate(self) -> float:
        return self.alpha / (self.alpha + self.beta)
    
    @property
    def total_observations(self) -> int:
        # subtract the prior (1, 1) to get real observation count
        return int(self.alpha + self.beta - 2)
    
    @property
    def uncertainty(self) -> float:
        """Variance of the Beta distribution, higher means less certain."""
        a, b = self.alpha, self.beta
        return (a * b) / ((a + b) ** 2 * (a + b + 1))


class ThompsonRouter:
    """
    Multi-armed bandit router for LLM model selection.
    Uses Thompson Sampling with Beta-Bernoulli conjugate model.
    """
    
    def __init__(self, models: list[str]):
        self.arms: dict[str, ModelArm] = {
            name: ModelArm(name=name) for name in models
        }
    
    def select_model(self) -> str:
        """Sample from each arm's distribution and select the highest."""
        samples = {name: arm.sample() for name, arm in self.arms.items()}
        return max(samples, key=samples.get)
    
    def record_outcome(self, model: str, success: bool):
        """Update the selected model's distribution based on outcome."""
        if model not in self.arms:
            raise ValueError(f"Unknown model: {model}")
        
        if success:
            self.arms[model].record_success()
        else:
            self.arms[model].record_failure()
    
    def get_stats(self) -> dict:
        return {
            name: {
                "estimated_success_rate": round(arm.estimated_success_rate, 4),
                "observations": arm.total_observations,
                "uncertainty": round(arm.uncertainty, 6),
                "alpha": arm.alpha,
                "beta": arm.beta
            }
            for name, arm in self.arms.items()
        }
    
    def save_state(self, path: str):
        state = {
            name: {"alpha": arm.alpha, "beta": arm.beta}
            for name, arm in self.arms.items()
        }
        with open(path, "w") as f:
            json.dump(state, f)
    
    def load_state(self, path: str):
        with open(path) as f:
            state = json.load(f)
        for name, params in state.items():
            if name in self.arms:
                self.arms[name].alpha = params["alpha"]
                self.arms[name].beta = params["beta"]

Wire it into an LLM caller:

import openai
import time

client = openai.OpenAI()

def call_with_thompson_routing(
    prompt: str,
    router: ThompsonRouter,
    success_fn = None  # optional: custom success evaluation
) -> tuple[str, str, bool]:
    """
    Make an LLM call with Thompson Sampling for model selection.
    Returns (response_text, model_used, was_success)
    """
    model = router.select_model()
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            timeout=30
        )
        latency_ms = (time.time() - start) * 1000
        content = response.choices[0].message.content
        
        # default success definition: got a response in reasonable time
        if success_fn is not None:
            success = success_fn(content, latency_ms)
        else:
            success = latency_ms < 10000 and len(content) > 0
        
        router.record_outcome(model, success)
        return content, model, success
        
    except Exception as e:
        router.record_outcome(model, False)
        raise


# example: 3-model router
router = ThompsonRouter(["gpt-4o", "gpt-4o-mini", "gpt-4-turbo"])

# simulate traffic and watch routing evolve
for i in range(20):
    response, model, success = call_with_thompson_routing(
        prompt="Summarize this paragraph in one sentence: ...",
        router=router
    )
    print(f"Call {i+1}: routed to {model} | success={success}")

# inspect learned distributions
import json
print(json.dumps(router.get_stats(), indent=2))

Simulating the Algorithm to Build Intuition

You don't need to make real API calls to see how Thompson Sampling behaves. This simulation shows convergence in action:

import numpy as np
import matplotlib.pyplot as plt

def simulate_thompson_routing(
    true_success_rates: dict[str, float],
    n_trials: int = 500,
    seed: int = 42
) -> dict:
    """
    Simulate Thompson Sampling routing without actual LLM calls.
    true_success_rates: ground truth success rates per model (unknown to algorithm)
    """
    np.random.seed(seed)
    router = ThompsonRouter(list(true_success_rates.keys()))
    
    selection_counts = {m: 0 for m in true_success_rates}
    
    for trial in range(n_trials):
        model = router.select_model()
        selection_counts[model] += 1
        
        # simulate outcome based on true success rate
        success = np.random.random() < true_success_rates[model]
        router.record_outcome(model, success)
    
    # compute final selection fractions
    total = sum(selection_counts.values())
    selection_fracs = {m: c/total for m, c in selection_counts.items()}
    
    print(f"\nAfter {n_trials} trials:")
    print(f"True success rates: {true_success_rates}")
    print(f"Selection fractions: {selection_fracs}")
    print(f"Learned estimates: {router.get_stats()}")
    
    return router

# scenario: gpt-4o outperforms mini on complex tasks
router = simulate_thompson_routing({
    "gpt-4o": 0.85,
    "gpt-4o-mini": 0.65,
    "gpt-4-turbo": 0.78
})

Run this and you'll see selection fractions converge toward the best model while maintaining exploration. With true rates of 0.85, 0.65, and 0.78, Thompson Sampling will route roughly 70-80% of traffic to gpt-4o after 500 trials, but keep trying the others enough to detect if their rates change.

That last point matters: if gpt-4o-mini's success rate jumps from 0.65 to 0.90 (maybe after an OpenAI update), Thompson Sampling will detect it and reallocate traffic within a few hundred calls. A hardcoded router never would.


Defining "Success" for LLM Routing

The algorithm is only as good as your success signal. Some options:

# latency-based: penalize slow responses
def latency_success(content: str, latency_ms: float) -> bool:
    return latency_ms < 5000 and len(content.strip()) > 10

# output quality: check for expected structure
def structured_output_success(content: str, latency_ms: float) -> bool:
    import json
    try:
        parsed = json.loads(content)
        return isinstance(parsed, dict) and len(parsed) > 0
    except json.JSONDecodeError:
        return False

# downstream task outcome: did the agent actually succeed?
# this is the gold standard but requires more instrumentation
def task_outcome_success(content: str, latency_ms: float) -> bool:
    # called with result of downstream task, not just LLM response
    # e.g., did the search query this model generated return results?
    # did the code this model wrote pass tests?
    pass  # wire to your actual task outcome

Downstream task outcome is the best signal. If your agent extracts a search query and then runs a search, record whether that search returned useful results, and attribute it back to the model that generated the query. That's the outcome that actually matters.


Kalibr's Production Version

The implementation above will get you very far. What it doesn't have: persistence across process restarts, multi-tenant outcome pooling (so your router benefits from outcomes across your whole fleet, not just your single instance), and automatic exploration rate tuning as distributions converge.

import kalibr
import openai

kalibr.init()
client = openai.OpenAI()

# kalibr runs Thompson Sampling for you, with persistent state 
# and outcome data pooled across your deployment
policy = kalibr.get_policy(task_context={
    "task_type": "synthesis",
    "quality_priority": 0.8
})

response = client.chat.completions.create(
    model=policy.recommended_model,
    messages=[{"role": "user", "content": prompt}]
)

# report outcome back to update the distributions
kalibr.record_outcome(
    policy_id=policy.id,
    success=True,
    latency_ms=response_latency
)

The policy.recommended_model field is Thompson Sampling's output for your task context. The record_outcome call feeds back into the Beta distribution update. Every call makes the next routing decision slightly better.


Key Points

Thompson Sampling for LLM routing is not a complicated idea:

  1. Represent each model's performance as a Beta distribution
  2. Sample from each distribution at decision time
  3. Route to the highest sample
  4. Update based on outcome

The algorithm's power comes from two properties: it naturally handles exploration/exploitation without any tuning, and it continuously adapts to changing performance without requiring you to notice and redeploy. That's exactly what you want from a model selection layer in a production agent system.

The hardcoded routing decisions you make today will be wrong eventually. Thompson Sampling is wrong too, but it corrects itself.

Kalibr keeps complex AI agents running without human intervention.

Get started free