How to Stop Using GPT-4o for Every Request (Without Writing Routing Rules)
Every request hitting your expensive model is a choice you made once and never revisited. You set model="gpt-4o", it worked, you shipped, and now that decision is baked into code that no one wants to touch. Meanwhile, a meaningful fraction of your traffic is classification, extraction, or yes/no tasks that a $0.15/1M token model handles identically.
This post explains how to stop using gpt-4o for every request, automatically, without maintaining routing rules.
The Problem with Static Rules
The obvious fix is to add routing logic:
def get_model(task_type: str) -> str:
if task_type in ("classify", "extract", "label"):
return "gpt-4o-mini"
return "gpt-4o"
response = client.chat.completions.create(
model=get_model(task_type),
messages=[{"role": "user", "content": prompt}]
)
This works on day one. By month three, you have four problems:
Problem 1: Rules are written at deploy time. You define what counts as "simple" based on what you know today. Traffic patterns shift. New task types emerge. The rules do not update themselves.
Problem 2: Rules are binary. A task either hits mini or gpt-4o. There is no exploration. If mini is performing well on tasks you assumed needed gpt-4o, you never find out.
Problem 3: Rule maintenance is invisible work. No one owns the routing rules. They accumulate as conditions, exceptions, and edge cases. Six months later, no one wants to change them.
Problem 4: Rules cannot learn from outcomes. The rule does not know whether the response was good. It just maps task type to model. That mapping is never validated against actual quality.
What Outcome-Based Routing Looks Like
Instead of rules, outcome-based routing works like this:
- You define what counts as success (response quality, user action, downstream metric)
- The router tries different models, observing which ones succeed on which requests
- The router shifts allocation toward models that work, away from models that fail
- The system self-corrects as traffic changes
The algorithm underneath this is Thompson Sampling, a Bayesian bandit approach that maintains a probability distribution over each model's success rate and samples from it when making routing decisions. More on that below.
Before: Static Routing
Here is a typical static routing setup:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
SIMPLE_TASK_TYPES = {"classify", "extract", "label", "yes_no", "summarize_short"}
def process_request(prompt: str, task_type: str) -> str:
# Written once. Never updated. Trust the original developer's intuition.
if task_type in SIMPLE_TASK_TYPES:
model = "gpt-4o-mini"
else:
model = "gpt-4o"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
return response.choices[0].message.content
# Works. Does not learn. Goes stale.
result = process_request(
"Is this review positive or negative? 'The product works but shipping was slow.'",
task_type="classify"
)
This is fine code. The issue is not the code. It is the assumption that the routing logic is correct and will stay correct. It probably will not.
After: Outcome-Based Routing with Kalibr
Kalibr replaces the static routing logic with a Thompson Sampling router. You define the models and a success condition. The router observes outcomes and adjusts.
import kalibr # Must be first import
from kalibr import Router
router = Router(
paths=[
{"model": "openai/gpt-4o-mini"},
{"model": "openai/gpt-4o"},
],
success_when="len(response.choices[0].message.content.strip()) > 0 and response.choices[0].finish_reason == 'stop'",
goal_id="sentiment_classification"
)
def process_request(prompt: str) -> str:
# No task_type needed. No routing rules. The router decides.
response = router.completion(
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
result = process_request(
"Is this review positive or negative? 'The product works but shipping was slow.'"
)
The success_when string is evaluated against the response. If the response is a non-empty completion, it counts as a success. Over time, the router learns which model succeeds more often on which type of content.
For more specific quality signals, report outcomes explicitly:
from kalibr import Router, Outcome
router = Router(
paths=[
{"model": "openai/gpt-4o-mini", "weight": 0.7},
{"model": "openai/gpt-4o", "weight": 0.3},
],
goal_id="support_ticket_triage"
)
def triage_ticket(ticket_text: str) -> dict:
response, request_id = router.completion(
messages=[
{"role": "system", "content": "Classify this support ticket as: billing, technical, account, or other. One word."},
{"role": "user", "content": ticket_text}
],
return_request_id=True
)
label = response.choices[0].message.content.strip().lower()
# Validate output quality
valid_labels = {"billing", "technical", "account", "other"}
success = label in valid_labels
router.report_outcome(
request_id=request_id,
outcome=Outcome.SUCCESS if success else Outcome.FAILURE
)
return {"label": label, "request_id": request_id, "valid": success}
result = triage_ticket("I was charged twice this month for my subscription.")
print(result)
Here the success condition is domain-specific: did the model return a valid label? If gpt-4o-mini consistently returns valid labels, the router shifts more traffic to it. If it starts failing on certain ticket types, the router shifts back toward gpt-4o.
How Thompson Sampling Works (Plain Terms)
Thompson Sampling maintains a Beta distribution for each model. A Beta distribution has two parameters: alpha (successes) and beta (failures). When you start, both models have the same prior, say, alpha=1, beta=1, which is uniform.
Every time the router makes a call and you report the outcome, it updates the winning model's alpha (success) or beta (failure) count. When choosing which model to use next, the router samples a random number from each model's Beta distribution and picks the one that sampled higher.
The effect: models with more successes get sampled higher more often, so they get more traffic. But because it is probabilistic sampling, not a hard cutoff, the losing model still gets occasional traffic. This is exploration, the router keeps checking whether the underdog has improved, or whether the current winner is still winning on new traffic patterns.
The practical consequence: Thompson Sampling routes most traffic to the best-performing model while continuously verifying that it is still the best. When your traffic mix changes, say, a new integration sends a batch of complex synthesis tasks, the router detects the performance shift and rebalances. No one has to update a config file.
When to Use Each Approach
If your task types are fixed and simple, static rules work. They require no ongoing infrastructure and are easy to reason about. The risk is staleness, but for narrow use cases that risk is low.
If you have multiple task types, evolving traffic, or care about long-term cost efficiency, outcome-based routing is the right default. You set it up once and it maintains itself. The only ongoing work is defining what counts as success, which is domain knowledge you already have.
The routing rules approach and the outcome-based approach are not mutually exclusive. You can seed the router with weights that reflect your existing rules, then let Thompson Sampling refine from there:
router = Router(
paths=[
# Reflect prior belief that mini handles ~70% of workload
{"model": "openai/gpt-4o-mini", "weight": 0.7},
{"model": "openai/gpt-4o", "weight": 0.3},
],
goal_id="mixed_workload",
success_when="response.choices[0].finish_reason == 'stop'"
)
The weights are initial priors. Thompson Sampling updates them from there.
Summary
Stopping gpt-4o for every request does not require a rules engine. It requires a router that observes outcomes. The key insight is that "which model is best for this request?" is an empirical question, not one you can answer definitively at deploy time. Your traffic will change. Your task mix will evolve. A static config is a snapshot of a belief you had once.
Outcome-based routing with Thompson Sampling turns routing into a live experiment. It shifts toward what works, away from what does not, and reports nothing to you because there is nothing to report. It is just working.
The setup cost is lower than writing and maintaining routing rules. The ongoing cost is near zero. The savings compound as your request volume grows.
Kalibr keeps complex AI agents running without human intervention.
Get started free