Routing Python LLM Requests to Cheaper Models Based on Task Complexity
Not all LLM requests are equal, but most systems treat them as if they are. A classification task and a multi-document synthesis task both hit the same client.chat.completions.create() call with the same model. One needs a fraction of the capability you are paying for.
This post shows how to route Python LLM requests to cheaper models based on task complexity. First building a complete router from scratch, then replacing the boilerplate with Kalibr.
Complexity Signals That Actually Matter
Before writing any routing code, you need to know what makes a task complex. There are four signals that are reliable and easy to extract without an extra API call:
Input length. Longer inputs generally require more synthesis. A 50-token prompt asking for a label is different from a 3,000-token prompt asking for an analysis.
Task type. Extraction and classification tasks have bounded outputs and require matching, not reasoning. Generation tasks (writing, synthesis, analysis) require more capability.
Structured output requirements. If you need JSON with a specific schema, nested fields, or constrained types, failure modes are more expensive. A cheap model that returns malformed JSON costs you a retry.
Presence of multi-step reasoning cues. Phrases like "compare and contrast," "explain why," "what are the tradeoffs" signal that the model needs to hold multiple threads simultaneously.
Building a TaskComplexityRouter from Scratch
Here is a complete implementation with complexity scoring, model selection, and outcome tracking:
from __future__ import annotations
import re
import time
from dataclasses import dataclass, field
from typing import Any
from openai import OpenAI
client = OpenAI()
@dataclass
class ModelStats:
model: str
successes: int = 0
failures: int = 0
total_cost_usd: float = 0.0
@property
def success_rate(self) -> float:
total = self.successes + self.failures
if total == 0:
return 0.5 # Neutral prior
return self.successes / total
# Token costs per 1M tokens (input, output)
MODEL_COSTS = {
"gpt-4o-mini": (0.15, 0.60),
"gpt-4o": (2.50, 10.00),
}
def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
input_rate, output_rate = MODEL_COSTS.get(model, (2.50, 10.00))
return (input_tokens * input_rate + output_tokens * output_rate) / 1_000_000
class TaskComplexityRouter:
"""Routes LLM requests to cheaper models based on task complexity."""
# Complexity score thresholds
SIMPLE_THRESHOLD = 3
COMPLEX_THRESHOLD = 6
def __init__(self):
self.stats: dict[str, ModelStats] = {
model: ModelStats(model=model) for model in MODEL_COSTS
}
self._history: list[dict] = []
def score_complexity(self, prompt: str, task_type: str = "general") -> int:
"""Returns a complexity score from 0 (very simple) to 10 (very complex)."""
score = 0
words = prompt.split()
# Input length signal (0-3 points)
if len(words) < 50:
score += 0
elif len(words) < 200:
score += 1
elif len(words) < 500:
score += 2
else:
score += 3
# Task type signal (0-3 points)
simple_types = {"classification", "extraction", "label", "yes_no", "entity_extraction"}
complex_types = {"synthesis", "analysis", "generation", "reasoning", "comparison"}
if task_type in simple_types:
score += 0
elif task_type in complex_types:
score += 3
else:
score += 1 # Unknown task type gets middle score
# Structured output complexity (0-2 points)
structured_keywords = ["json", "schema", "nested", "array of", "list of objects"]
if any(kw in prompt.lower() for kw in structured_keywords):
score += 2
# Multi-step reasoning cues (0-2 points)
reasoning_cues = [
"compare", "contrast", "tradeoffs", "explain why", "analyze",
"evaluate", "what are the implications", "step by step",
"pros and cons", "synthesize"
]
cue_count = sum(1 for cue in reasoning_cues if cue in prompt.lower())
score += min(2, cue_count)
return min(10, score)
def select_model(self, complexity_score: int) -> str:
"""Select model based on complexity score."""
if complexity_score <= self.SIMPLE_THRESHOLD:
return "gpt-4o-mini"
elif complexity_score >= self.COMPLEX_THRESHOLD:
return "gpt-4o"
else:
# Middle range: use success rate data to decide
mini_stats = self.stats["gpt-4o-mini"]
gpt4o_stats = self.stats["gpt-4o"]
# If mini has a good track record on medium complexity, prefer it
if mini_stats.success_rate >= 0.85 and mini_stats.successes >= 10:
return "gpt-4o-mini"
return "gpt-4o"
def complete(
self,
prompt: str,
task_type: str = "general",
system_prompt: str | None = None,
**kwargs: Any,
) -> dict:
"""Route and execute completion, returning response + metadata."""
complexity = self.score_complexity(prompt, task_type)
model = self.select_model(complexity)
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": prompt})
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
latency_ms = (time.perf_counter() - start) * 1000
usage = response.usage
cost = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)
record = {
"model": model,
"complexity_score": complexity,
"task_type": task_type,
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
"cost_usd": cost,
"latency_ms": latency_ms,
}
self._history.append(record)
return {
"response": response,
"metadata": record,
}
def report_outcome(self, model: str, success: bool) -> None:
"""Update model stats based on observed outcome."""
stats = self.stats[model]
if success:
stats.successes += 1
else:
stats.failures += 1
def cost_report(self) -> dict:
"""Summarize cost and routing decisions."""
total_cost = sum(r["cost_usd"] for r in self._history)
mini_calls = sum(1 for r in self._history if r["model"] == "gpt-4o-mini")
gpt4o_calls = sum(1 for r in self._history if r["model"] == "gpt-4o")
return {
"total_cost_usd": round(total_cost, 6),
"total_requests": len(self._history),
"gpt4o_mini_calls": mini_calls,
"gpt4o_calls": gpt4o_calls,
"mini_fraction": mini_calls / len(self._history) if self._history else 0,
}
Using it:
router = TaskComplexityRouter()
# Simple extraction task
result = router.complete(
prompt="Extract the company name from: 'I work at Acme Corp in San Francisco'",
task_type="extraction",
system_prompt="Extract the company name. Return only the name, nothing else."
)
print(result["metadata"]["model"]) # gpt-4o-mini
print(result["metadata"]["complexity_score"]) # 0-2
# Complex synthesis task
result2 = router.complete(
prompt=(
"Analyze the competitive dynamics between cloud providers in the AI inference market. "
"Compare AWS, GCP, and Azure on latency, cost, and developer experience. "
"What are the tradeoffs a startup should consider when choosing a provider?"
),
task_type="analysis",
)
print(result2["metadata"]["model"]) # gpt-4o
print(result2["metadata"]["complexity_score"]) # 7-9
# Report outcomes
router.report_outcome(result["metadata"]["model"], success=True)
router.report_outcome(result2["metadata"]["model"], success=True)
print(router.cost_report())
This router works. It handles the core complexity signals, tracks costs, and adapts model selection based on success rates. The weakness: it is static routing logic decorated with stats. The success rates inform future middle-range decisions, but the thresholds and signals are hardcoded.
Replacing the Boilerplate with Kalibr
The TaskComplexityRouter above is about 100 lines of code that you now own, test, and maintain. Kalibr handles the routing and outcome tracking with a few lines:
import kalibr # Must be first import
from kalibr import Router
router = Router(
paths=[
{"model": "openai/gpt-4o-mini", "weight": 0.7},
{"model": "openai/gpt-4o", "weight": 0.3},
],
success_when="response.choices[0].finish_reason == 'stop'",
goal_id="task_complexity_routing"
)
def complete(prompt: str, system_prompt: str | None = None) -> str:
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": prompt})
response = router.completion(messages=messages)
return response.choices[0].message.content
For explicit outcome reporting:
from kalibr import Router, Outcome
router = Router(
paths=[
{"model": "openai/gpt-4o-mini", "weight": 0.7},
{"model": "openai/gpt-4o", "weight": 0.3},
],
goal_id="task_complexity_routing_v2"
)
def complete_with_feedback(prompt: str, validator=None) -> dict:
response, request_id = router.completion(
messages=[{"role": "user", "content": prompt}],
return_request_id=True
)
content = response.choices[0].message.content
# Validate output if validator provided
success = True
if validator is not None:
success = validator(content)
router.report_outcome(
request_id=request_id,
outcome=Outcome.SUCCESS if success else Outcome.FAILURE
)
return {"content": content, "request_id": request_id, "success": success}
# For an extraction task, validate the extracted value looks reasonable
def is_valid_company_name(output: str) -> bool:
# Non-empty, no special chars, reasonable length
return (
len(output.strip()) > 0
and len(output.strip()) < 100
and not re.search(r"[<>{}\[\]]", output)
)
result = complete_with_feedback(
"Extract the company name from: 'I work at Stripe in Dublin'",
validator=is_valid_company_name
)
print(result["content"]) # Stripe
What Kalibr's Thompson Sampling Adds
The handwritten TaskComplexityRouter uses deterministic thresholds. Kalibr uses Thompson Sampling, which is probabilistic. The difference matters at scale:
- Thompson Sampling keeps exploring. Even when gpt-4o-mini is winning 90% of the time, it still occasionally routes to gpt-4o. This matters when your traffic mix shifts, a new integration sends harder tasks, and the router needs to detect that gpt-4o-mini is now failing more often.
- Thompson Sampling is self-correcting. You do not need to tune thresholds. You report outcomes, and the allocation adjusts automatically.
- Thompson Sampling handles uncertainty well. When you have few data points, it stays cautious. As data accumulates, confidence increases and routing becomes more decisive.
Summary
Routing Python LLM requests to cheaper models based on task complexity is not complicated, the signals are observable and the code is straightforward. The TaskComplexityRouter class above handles the core problem.
The argument for Kalibr is not that it does something different. It is that it removes the code you have to own and adds the feedback loop that static routing lacks. The complexity signals you would hardcode into rules become implicit in the outcome data. The router observes what works, not what you thought would work.
Start with the heuristics. Move to outcome-based routing when the rules become a maintenance problem.
Kalibr keeps complex AI agents running without human intervention.
Get started free