How to Cut Your LLM Bill in Half Without Touching Your Agent's Quality
GPT-4o for every call is expensive. GPT-4o-mini for every call degrades quality. The right answer routes dynamically based on what each task actually needs, and it's not hard to wire up.
The Cost Problem in Numbers
Let's be concrete. As of early 2026, approximate token pricing:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| gpt-4o | $2.50 | $10.00 |
| gpt-4o-mini | $0.15 | $0.60 |
A typical agent call breaks down like this:
- Classification / routing decision: ~200 tokens in, ~50 tokens out
- Simple extraction / formatting: ~500 tokens in, ~200 tokens out
- Complex synthesis / reasoning: ~2,000 tokens in, ~800 tokens out
If your agent makes 10,000 calls per day, split evenly between those three types:
All GPT-4o:
- Classification: 3,333 calls x (200 in + 50 out) = ~833K tokens = $0.71 input + $0.33 output = $1.04
- Extraction: 3,333 calls x (500 in + 200 out) = ~2.3M tokens = $1.97 input + $1.33 output = $3.30
- Synthesis: 3,333 calls x (2,000 in + 800 out) = ~9.3M tokens = $7.92 input + $5.33 output = $13.25
- Daily total: ~$17.59
Routed dynamically (classification and extraction on mini, synthesis on 4o):
- Classification: $0.071 (mini pricing)
- Extraction: $0.21 (mini pricing)
- Synthesis: $13.25 (4o, unchanged)
- Daily total: ~$13.53
That's a 23% reduction just by routing two task types to mini. At higher call volumes, the gap compounds. And if your workload has more classification than synthesis, which most production agents do, you'll see 40-60% reductions.
Classify Tasks Before You Route
The simplest version: add a task type tag to every call and route based on it.
from enum import Enum
import openai
client = openai.OpenAI()
class TaskComplexity(Enum):
SIMPLE = "simple" # classification, extraction, formatting
MODERATE = "moderate" # summarization, translation, structured output
COMPLEX = "complex" # reasoning, synthesis, code generation, analysis
MODEL_MAP = {
TaskComplexity.SIMPLE: "gpt-4o-mini",
TaskComplexity.MODERATE: "gpt-4o-mini",
TaskComplexity.COMPLEX: "gpt-4o",
}
def call_model(
prompt: str,
complexity: TaskComplexity,
system_prompt: str = "You are a helpful assistant."
) -> tuple[str, str, int]:
"""Returns (response, model_used, total_tokens)"""
model = MODEL_MAP[complexity]
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
)
total_tokens = response.usage.total_tokens
content = response.choices[0].message.content
return content, model, total_tokens
Explicit tagging works well when you control the call sites. Your extraction functions always pass TaskComplexity.SIMPLE. Your synthesis functions always pass TaskComplexity.COMPLEX. You get predictable routing with zero inference overhead.
Auto-Classification When You Don't Control Every Call Site
For agents with a unified call interface, you can auto-classify based on prompt characteristics:
import re
from dataclasses import dataclass
@dataclass
class RoutingDecision:
complexity: TaskComplexity
model: str
reasoning: str
def classify_prompt(prompt: str, context: dict | None = None) -> RoutingDecision:
"""
Classify prompt complexity without an LLM call.
Returns routing decision with reasoning.
"""
prompt_lower = prompt.lower()
word_count = len(prompt.split())
# signals that suggest simple tasks
simple_signals = [
"extract", "parse", "format", "convert", "list",
"true or false", "yes or no", "classify as", "categorize"
]
# signals that suggest complex tasks
complex_signals = [
"analyze", "synthesize", "design", "architect", "reason",
"explain why", "compare and contrast", "what would happen if",
"write code", "debug", "refactor"
]
simple_score = sum(1 for s in simple_signals if s in prompt_lower)
complex_score = sum(1 for s in complex_signals if s in prompt_lower)
# long prompts with complex signals = complex task
if complex_score >= 2 or (word_count > 500 and complex_score >= 1):
return RoutingDecision(
complexity=TaskComplexity.COMPLEX,
model="gpt-4o",
reasoning=f"complex_signals={complex_score}, words={word_count}"
)
# short prompts with simple signals = simple task
if simple_score >= 1 and word_count < 200 and complex_score == 0:
return RoutingDecision(
complexity=TaskComplexity.SIMPLE,
model="gpt-4o-mini",
reasoning=f"simple_signals={simple_score}, words={word_count}"
)
# default to moderate
return RoutingDecision(
complexity=TaskComplexity.MODERATE,
model="gpt-4o-mini",
reasoning=f"default moderate, words={word_count}"
)
def smart_call(prompt: str) -> tuple[str, RoutingDecision]:
decision = classify_prompt(prompt)
response = client.chat.completions.create(
model=decision.model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content, decision
# example usage
result, decision = smart_call("Extract all email addresses from this text: ...")
print(f"Used {decision.model} ({decision.reasoning})")
result, decision = smart_call(
"Analyze the architectural tradeoffs between event-driven and request-response "
"patterns for a distributed agent system handling variable workloads..."
)
print(f"Used {decision.model} ({decision.reasoning})")
Budget-Constrained Routing
For cost ceiling enforcement, track spend per session and tighten the routing threshold as you approach limits:
import time
from dataclasses import dataclass, field
@dataclass
class BudgetTracker:
daily_limit_usd: float
spent_usd: float = 0.0
call_count: int = 0
reset_at: float = field(default_factory=lambda: time.time() + 86400)
# approximate cost per token (input cost dominates)
COST_PER_TOKEN = {
"gpt-4o": 0.0000025, # $2.50 per 1M input
"gpt-4o-mini": 0.00000015, # $0.15 per 1M input
}
def record_call(self, model: str, tokens: int):
if time.time() > self.reset_at:
self.spent_usd = 0.0
self.call_count = 0
self.reset_at = time.time() + 86400
cost = self.COST_PER_TOKEN.get(model, 0.0000025) * tokens
self.spent_usd += cost
self.call_count += 1
def budget_remaining_fraction(self) -> float:
return max(0.0, 1.0 - (self.spent_usd / self.daily_limit_usd))
def select_model_for_complexity(self, complexity: TaskComplexity) -> str:
remaining = self.budget_remaining_fraction()
if remaining < 0.1:
# under 10% budget left, route everything to mini
return "gpt-4o-mini"
if remaining < 0.3:
# under 30% budget, only route COMPLEX to 4o
return "gpt-4o" if complexity == TaskComplexity.COMPLEX else "gpt-4o-mini"
# normal routing
return MODEL_MAP[complexity]
# wire it together
budget = BudgetTracker(daily_limit_usd=50.0)
def budget_aware_call(prompt: str) -> tuple[str, str, float]:
"""Returns (response, model_used, estimated_cost)"""
decision = classify_prompt(prompt)
model = budget.select_model_for_complexity(decision.complexity)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
tokens = response.usage.total_tokens
budget.record_call(model, tokens)
estimated_cost = budget.COST_PER_TOKEN[model] * tokens
print(
f"Budget: ${budget.spent_usd:.4f} / ${budget.daily_limit_usd} "
f"({budget.budget_remaining_fraction():.0%} remaining)"
)
return response.choices[0].message.content, model, estimated_cost
Why Static Routing Decisions Go Stale
Here's the problem with any routing logic you write at deploy time: it encodes your best guess about task distributions right now.
Six months from now:
- Your agent's workload shifts. What was 30% synthesis is now 60%.
- A new model releases at a lower price point. You have to manually update.
- Mini gets noticeably better at a subset of your tasks. You don't know until you run an eval.
The classification rules in the code above are also guesses. "extract" in the prompt is a decent signal for simple tasks, except when someone asks the agent to "extract insights from this 50-page research report and synthesize them with our existing strategy docs," which is definitely not a simple task.
Static routing is better than no routing. But it drifts.
What Adaptive Routing Looks Like
Kalibr solves this by tracking outcomes per task type and updating routing weights based on what's actually working:
import kalibr
kalibr.init()
# kalibr's router classifies your task, selects a model based on
# live outcome data, and updates weights after each call
policy = kalibr.get_policy(task_context={
"type": "synthesis",
"input_length": len(prompt),
"priority": "quality" # or "cost" or "speed"
})
# use the recommended model
response = client.chat.completions.create(
model=policy.recommended_model,
messages=[{"role": "user", "content": prompt}]
)
The difference from the static classifier: Kalibr has outcome data from your actual traffic. If gpt-4o-mini starts underperforming on your specific synthesis tasks (measured by downstream task success, not just latency), routing weight shifts back to gpt-4o automatically. You don't have to redeploy to fix a routing regression.
Shipping Order
- Add explicit complexity tags to your call sites. It takes an afternoon and immediately reduces spend on classification and extraction tasks.
- Wire in the
BudgetTrackerso you don't get surprised at the end of the month. - Instrument your calls with outcome data. Even a simple
success: boolfield alongside each call gives you the data you need to evaluate whether your routing is working. - Once you have outcome data, let it drive routing weights. That's when the system improves itself.
The token cost math is not subtle. Routing classification calls to mini and synthesis calls to gpt-4o is probably the highest-ROI thing you can do in an afternoon for an agent in production.
Kalibr keeps complex AI agents running without human intervention.
Get started free