April 2026

Routing Python LLM Requests to Cheaper Models Based on Task Complexity

Not all LLM requests are equal, but most systems treat them as if they are. A classification task and a multi-document synthesis task both hit the same client.chat.completions.create() call with the same model. One needs a fraction of the capability you are paying for.

This post shows how to route Python LLM requests to cheaper models based on task complexity. First building a complete router from scratch, then replacing the boilerplate with Kalibr.

Complexity Signals That Actually Matter

Before writing any routing code, you need to know what makes a task complex. There are four signals that are reliable and easy to extract without an extra API call:

Input length. Longer inputs generally require more synthesis. A 50-token prompt asking for a label is different from a 3,000-token prompt asking for an analysis.

Task type. Extraction and classification tasks have bounded outputs and require matching, not reasoning. Generation tasks (writing, synthesis, analysis) require more capability.

Structured output requirements. If you need JSON with a specific schema, nested fields, or constrained types, failure modes are more expensive. A cheap model that returns malformed JSON costs you a retry.

Presence of multi-step reasoning cues. Phrases like "compare and contrast," "explain why," "what are the tradeoffs" signal that the model needs to hold multiple threads simultaneously.

Building a TaskComplexityRouter from Scratch

Here is a complete implementation with complexity scoring, model selection, and outcome tracking:

from __future__ import annotations

import re
import time
from dataclasses import dataclass, field
from typing import Any

from openai import OpenAI

client = OpenAI()


@dataclass
class ModelStats:
    model: str
    successes: int = 0
    failures: int = 0
    total_cost_usd: float = 0.0

    @property
    def success_rate(self) -> float:
        total = self.successes + self.failures
        if total == 0:
            return 0.5  # Neutral prior
        return self.successes / total


# Token costs per 1M tokens (input, output)
MODEL_COSTS = {
    "gpt-4o-mini": (0.15, 0.60),
    "gpt-4o":      (2.50, 10.00),
}


def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    input_rate, output_rate = MODEL_COSTS.get(model, (2.50, 10.00))
    return (input_tokens * input_rate + output_tokens * output_rate) / 1_000_000


class TaskComplexityRouter:
    """Routes LLM requests to cheaper models based on task complexity."""

    # Complexity score thresholds
    SIMPLE_THRESHOLD = 3
    COMPLEX_THRESHOLD = 6

    def __init__(self):
        self.stats: dict[str, ModelStats] = {
            model: ModelStats(model=model) for model in MODEL_COSTS
        }
        self._history: list[dict] = []

    def score_complexity(self, prompt: str, task_type: str = "general") -> int:
        """Returns a complexity score from 0 (very simple) to 10 (very complex)."""
        score = 0
        words = prompt.split()

        # Input length signal (0-3 points)
        if len(words) < 50:
            score += 0
        elif len(words) < 200:
            score += 1
        elif len(words) < 500:
            score += 2
        else:
            score += 3

        # Task type signal (0-3 points)
        simple_types = {"classification", "extraction", "label", "yes_no", "entity_extraction"}
        complex_types = {"synthesis", "analysis", "generation", "reasoning", "comparison"}

        if task_type in simple_types:
            score += 0
        elif task_type in complex_types:
            score += 3
        else:
            score += 1  # Unknown task type gets middle score

        # Structured output complexity (0-2 points)
        structured_keywords = ["json", "schema", "nested", "array of", "list of objects"]
        if any(kw in prompt.lower() for kw in structured_keywords):
            score += 2

        # Multi-step reasoning cues (0-2 points)
        reasoning_cues = [
            "compare", "contrast", "tradeoffs", "explain why", "analyze",
            "evaluate", "what are the implications", "step by step",
            "pros and cons", "synthesize"
        ]
        cue_count = sum(1 for cue in reasoning_cues if cue in prompt.lower())
        score += min(2, cue_count)

        return min(10, score)

    def select_model(self, complexity_score: int) -> str:
        """Select model based on complexity score."""
        if complexity_score <= self.SIMPLE_THRESHOLD:
            return "gpt-4o-mini"
        elif complexity_score >= self.COMPLEX_THRESHOLD:
            return "gpt-4o"
        else:
            # Middle range: use success rate data to decide
            mini_stats = self.stats["gpt-4o-mini"]
            gpt4o_stats = self.stats["gpt-4o"]

            # If mini has a good track record on medium complexity, prefer it
            if mini_stats.success_rate >= 0.85 and mini_stats.successes >= 10:
                return "gpt-4o-mini"
            return "gpt-4o"

    def complete(
        self,
        prompt: str,
        task_type: str = "general",
        system_prompt: str | None = None,
        **kwargs: Any,
    ) -> dict:
        """Route and execute completion, returning response + metadata."""
        complexity = self.score_complexity(prompt, task_type)
        model = self.select_model(complexity)

        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": prompt})

        start = time.perf_counter()
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        )
        latency_ms = (time.perf_counter() - start) * 1000

        usage = response.usage
        cost = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)

        record = {
            "model": model,
            "complexity_score": complexity,
            "task_type": task_type,
            "input_tokens": usage.prompt_tokens,
            "output_tokens": usage.completion_tokens,
            "cost_usd": cost,
            "latency_ms": latency_ms,
        }
        self._history.append(record)

        return {
            "response": response,
            "metadata": record,
        }

    def report_outcome(self, model: str, success: bool) -> None:
        """Update model stats based on observed outcome."""
        stats = self.stats[model]
        if success:
            stats.successes += 1
        else:
            stats.failures += 1

    def cost_report(self) -> dict:
        """Summarize cost and routing decisions."""
        total_cost = sum(r["cost_usd"] for r in self._history)
        mini_calls = sum(1 for r in self._history if r["model"] == "gpt-4o-mini")
        gpt4o_calls = sum(1 for r in self._history if r["model"] == "gpt-4o")

        return {
            "total_cost_usd": round(total_cost, 6),
            "total_requests": len(self._history),
            "gpt4o_mini_calls": mini_calls,
            "gpt4o_calls": gpt4o_calls,
            "mini_fraction": mini_calls / len(self._history) if self._history else 0,
        }

Using it:

router = TaskComplexityRouter()

# Simple extraction task
result = router.complete(
    prompt="Extract the company name from: 'I work at Acme Corp in San Francisco'",
    task_type="extraction",
    system_prompt="Extract the company name. Return only the name, nothing else."
)
print(result["metadata"]["model"])  # gpt-4o-mini
print(result["metadata"]["complexity_score"])  # 0-2

# Complex synthesis task
result2 = router.complete(
    prompt=(
        "Analyze the competitive dynamics between cloud providers in the AI inference market. "
        "Compare AWS, GCP, and Azure on latency, cost, and developer experience. "
        "What are the tradeoffs a startup should consider when choosing a provider?"
    ),
    task_type="analysis",
)
print(result2["metadata"]["model"])  # gpt-4o
print(result2["metadata"]["complexity_score"])  # 7-9

# Report outcomes
router.report_outcome(result["metadata"]["model"], success=True)
router.report_outcome(result2["metadata"]["model"], success=True)

print(router.cost_report())

This router works. It handles the core complexity signals, tracks costs, and adapts model selection based on success rates. The weakness: it is static routing logic decorated with stats. The success rates inform future middle-range decisions, but the thresholds and signals are hardcoded.

Replacing the Boilerplate with Kalibr

The TaskComplexityRouter above is about 100 lines of code that you now own, test, and maintain. Kalibr handles the routing and outcome tracking with a few lines:

import kalibr  # Must be first import
from kalibr import Router

router = Router(
    paths=[
        {"model": "openai/gpt-4o-mini", "weight": 0.7},
        {"model": "openai/gpt-4o",      "weight": 0.3},
    ],
    success_when="response.choices[0].finish_reason == 'stop'",
    goal_id="task_complexity_routing"
)


def complete(prompt: str, system_prompt: str | None = None) -> str:
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})

    response = router.completion(messages=messages)
    return response.choices[0].message.content

For explicit outcome reporting:

from kalibr import Router, Outcome

router = Router(
    paths=[
        {"model": "openai/gpt-4o-mini", "weight": 0.7},
        {"model": "openai/gpt-4o",      "weight": 0.3},
    ],
    goal_id="task_complexity_routing_v2"
)


def complete_with_feedback(prompt: str, validator=None) -> dict:
    response, request_id = router.completion(
        messages=[{"role": "user", "content": prompt}],
        return_request_id=True
    )

    content = response.choices[0].message.content

    # Validate output if validator provided
    success = True
    if validator is not None:
        success = validator(content)

    router.report_outcome(
        request_id=request_id,
        outcome=Outcome.SUCCESS if success else Outcome.FAILURE
    )

    return {"content": content, "request_id": request_id, "success": success}


# For an extraction task, validate the extracted value looks reasonable
def is_valid_company_name(output: str) -> bool:
    # Non-empty, no special chars, reasonable length
    return (
        len(output.strip()) > 0
        and len(output.strip()) < 100
        and not re.search(r"[<>{}\[\]]", output)
    )


result = complete_with_feedback(
    "Extract the company name from: 'I work at Stripe in Dublin'",
    validator=is_valid_company_name
)
print(result["content"])  # Stripe

What Kalibr's Thompson Sampling Adds

The handwritten TaskComplexityRouter uses deterministic thresholds. Kalibr uses Thompson Sampling, which is probabilistic. The difference matters at scale:

Summary

Routing Python LLM requests to cheaper models based on task complexity is not complicated, the signals are observable and the code is straightforward. The TaskComplexityRouter class above handles the core problem.

The argument for Kalibr is not that it does something different. It is that it removes the code you have to own and adds the feedback loop that static routing lacks. The complexity signals you would hardcode into rules become implicit in the outcome data. The router observes what works, not what you thought would work.

Start with the heuristics. Move to outcome-based routing when the rules become a maintenance problem.

Kalibr keeps complex AI agents running without human intervention.

Get started free