April 2026

How to Cut Your LLM Bill in Half Without Touching Your Agent's Quality

GPT-4o for every call is expensive. GPT-4o-mini for every call degrades quality. The right answer routes dynamically based on what each task actually needs, and it's not hard to wire up.

The Cost Problem in Numbers

Let's be concrete. As of early 2026, approximate token pricing:

ModelInput (per 1M tokens)Output (per 1M tokens)
gpt-4o$2.50$10.00
gpt-4o-mini$0.15$0.60

A typical agent call breaks down like this:

If your agent makes 10,000 calls per day, split evenly between those three types:

All GPT-4o:

Routed dynamically (classification and extraction on mini, synthesis on 4o):

That's a 23% reduction just by routing two task types to mini. At higher call volumes, the gap compounds. And if your workload has more classification than synthesis, which most production agents do, you'll see 40-60% reductions.


Classify Tasks Before You Route

The simplest version: add a task type tag to every call and route based on it.

from enum import Enum
import openai

client = openai.OpenAI()

class TaskComplexity(Enum):
    SIMPLE = "simple"      # classification, extraction, formatting
    MODERATE = "moderate"  # summarization, translation, structured output
    COMPLEX = "complex"    # reasoning, synthesis, code generation, analysis

MODEL_MAP = {
    TaskComplexity.SIMPLE: "gpt-4o-mini",
    TaskComplexity.MODERATE: "gpt-4o-mini",
    TaskComplexity.COMPLEX: "gpt-4o",
}

def call_model(
    prompt: str,
    complexity: TaskComplexity,
    system_prompt: str = "You are a helpful assistant."
) -> tuple[str, str, int]:
    """Returns (response, model_used, total_tokens)"""
    model = MODEL_MAP[complexity]
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    )
    
    total_tokens = response.usage.total_tokens
    content = response.choices[0].message.content
    
    return content, model, total_tokens

Explicit tagging works well when you control the call sites. Your extraction functions always pass TaskComplexity.SIMPLE. Your synthesis functions always pass TaskComplexity.COMPLEX. You get predictable routing with zero inference overhead.


Auto-Classification When You Don't Control Every Call Site

For agents with a unified call interface, you can auto-classify based on prompt characteristics:

import re
from dataclasses import dataclass

@dataclass
class RoutingDecision:
    complexity: TaskComplexity
    model: str
    reasoning: str

def classify_prompt(prompt: str, context: dict | None = None) -> RoutingDecision:
    """
    Classify prompt complexity without an LLM call.
    Returns routing decision with reasoning.
    """
    prompt_lower = prompt.lower()
    word_count = len(prompt.split())
    
    # signals that suggest simple tasks
    simple_signals = [
        "extract", "parse", "format", "convert", "list",
        "true or false", "yes or no", "classify as", "categorize"
    ]
    
    # signals that suggest complex tasks
    complex_signals = [
        "analyze", "synthesize", "design", "architect", "reason",
        "explain why", "compare and contrast", "what would happen if",
        "write code", "debug", "refactor"
    ]
    
    simple_score = sum(1 for s in simple_signals if s in prompt_lower)
    complex_score = sum(1 for s in complex_signals if s in prompt_lower)
    
    # long prompts with complex signals = complex task
    if complex_score >= 2 or (word_count > 500 and complex_score >= 1):
        return RoutingDecision(
            complexity=TaskComplexity.COMPLEX,
            model="gpt-4o",
            reasoning=f"complex_signals={complex_score}, words={word_count}"
        )
    
    # short prompts with simple signals = simple task
    if simple_score >= 1 and word_count < 200 and complex_score == 0:
        return RoutingDecision(
            complexity=TaskComplexity.SIMPLE,
            model="gpt-4o-mini",
            reasoning=f"simple_signals={simple_score}, words={word_count}"
        )
    
    # default to moderate
    return RoutingDecision(
        complexity=TaskComplexity.MODERATE,
        model="gpt-4o-mini",
        reasoning=f"default moderate, words={word_count}"
    )

def smart_call(prompt: str) -> tuple[str, RoutingDecision]:
    decision = classify_prompt(prompt)
    
    response = client.chat.completions.create(
        model=decision.model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content, decision

# example usage
result, decision = smart_call("Extract all email addresses from this text: ...")
print(f"Used {decision.model} ({decision.reasoning})")

result, decision = smart_call(
    "Analyze the architectural tradeoffs between event-driven and request-response "
    "patterns for a distributed agent system handling variable workloads..."
)
print(f"Used {decision.model} ({decision.reasoning})")

Budget-Constrained Routing

For cost ceiling enforcement, track spend per session and tighten the routing threshold as you approach limits:

import time
from dataclasses import dataclass, field

@dataclass  
class BudgetTracker:
    daily_limit_usd: float
    spent_usd: float = 0.0
    call_count: int = 0
    reset_at: float = field(default_factory=lambda: time.time() + 86400)
    
    # approximate cost per token (input cost dominates)
    COST_PER_TOKEN = {
        "gpt-4o": 0.0000025,       # $2.50 per 1M input
        "gpt-4o-mini": 0.00000015,  # $0.15 per 1M input
    }
    
    def record_call(self, model: str, tokens: int):
        if time.time() > self.reset_at:
            self.spent_usd = 0.0
            self.call_count = 0
            self.reset_at = time.time() + 86400
        
        cost = self.COST_PER_TOKEN.get(model, 0.0000025) * tokens
        self.spent_usd += cost
        self.call_count += 1
    
    def budget_remaining_fraction(self) -> float:
        return max(0.0, 1.0 - (self.spent_usd / self.daily_limit_usd))
    
    def select_model_for_complexity(self, complexity: TaskComplexity) -> str:
        remaining = self.budget_remaining_fraction()
        
        if remaining < 0.1:
            # under 10% budget left, route everything to mini
            return "gpt-4o-mini"
        
        if remaining < 0.3:
            # under 30% budget, only route COMPLEX to 4o
            return "gpt-4o" if complexity == TaskComplexity.COMPLEX else "gpt-4o-mini"
        
        # normal routing
        return MODEL_MAP[complexity]


# wire it together
budget = BudgetTracker(daily_limit_usd=50.0)

def budget_aware_call(prompt: str) -> tuple[str, str, float]:
    """Returns (response, model_used, estimated_cost)"""
    decision = classify_prompt(prompt)
    model = budget.select_model_for_complexity(decision.complexity)
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    tokens = response.usage.total_tokens
    budget.record_call(model, tokens)
    
    estimated_cost = budget.COST_PER_TOKEN[model] * tokens
    
    print(
        f"Budget: ${budget.spent_usd:.4f} / ${budget.daily_limit_usd} "
        f"({budget.budget_remaining_fraction():.0%} remaining)"
    )
    
    return response.choices[0].message.content, model, estimated_cost

Why Static Routing Decisions Go Stale

Here's the problem with any routing logic you write at deploy time: it encodes your best guess about task distributions right now.

Six months from now:

The classification rules in the code above are also guesses. "extract" in the prompt is a decent signal for simple tasks, except when someone asks the agent to "extract insights from this 50-page research report and synthesize them with our existing strategy docs," which is definitely not a simple task.

Static routing is better than no routing. But it drifts.


What Adaptive Routing Looks Like

Kalibr solves this by tracking outcomes per task type and updating routing weights based on what's actually working:

import kalibr

kalibr.init()

# kalibr's router classifies your task, selects a model based on 
# live outcome data, and updates weights after each call
policy = kalibr.get_policy(task_context={
    "type": "synthesis",
    "input_length": len(prompt),
    "priority": "quality"  # or "cost" or "speed"
})

# use the recommended model
response = client.chat.completions.create(
    model=policy.recommended_model,
    messages=[{"role": "user", "content": prompt}]
)

The difference from the static classifier: Kalibr has outcome data from your actual traffic. If gpt-4o-mini starts underperforming on your specific synthesis tasks (measured by downstream task success, not just latency), routing weight shifts back to gpt-4o automatically. You don't have to redeploy to fix a routing regression.


Shipping Order

  1. Add explicit complexity tags to your call sites. It takes an afternoon and immediately reduces spend on classification and extraction tasks.
  2. Wire in the BudgetTracker so you don't get surprised at the end of the month.
  3. Instrument your calls with outcome data. Even a simple success: bool field alongside each call gives you the data you need to evaluate whether your routing is working.
  4. Once you have outcome data, let it drive routing weights. That's when the system improves itself.

The token cost math is not subtle. Routing classification calls to mini and synthesis calls to gpt-4o is probably the highest-ROI thing you can do in an afternoon for an agent in production.

Kalibr keeps complex AI agents running without human intervention.

Get started free