April 2026

Automatically Downgrading LLM Models for Simple Requests

"Model downgrade" sounds like a quality tradeoff. It is not, if you do it on the right requests. On the wrong requests, yes. You lose quality. On the right ones, you get identical output at 15x lower cost. The entire problem reduces to: which requests are the right ones?

This post defines "simple requests" concretely, shows a working detector, demonstrates what quality looks like when you downgrade correctly, and closes with the routing pattern that ties it together.

What "Simple Requests" Actually Means

Vague definitions cause bad routing. Here is a concrete definition with five criteria:

1. Single-turn. No conversation history required. The model does not need context from previous messages to answer correctly.

2. Short input. Under 200 tokens. Long inputs may still be simple, but short inputs are almost never complex.

3. No tool use. The response does not require calling an external function, running code, or fetching data. Pure text generation from context.

4. Extractive, not generative. The answer exists in the input, the model finds it, does not create it. Entity extraction, classification, and yes/no questions are extractive. Writing, analysis, and synthesis are generative.

5. Verifiable output. You can check whether the output is correct without reading it carefully. A classification label is either valid or invalid. An extracted phone number either matches a format or does not. A generated essay requires human judgment.

Tasks that meet all five criteria are good candidates for automatic downgrade. Tasks that fail even one, especially criteria 4 or 5, should stay on the capable model until you have evidence otherwise.

SimpleRequestDetector

from __future__ import annotations

import re
from dataclasses import dataclass


@dataclass
class ComplexityAssessment:
    score: int          # 0 = simplest, 10 = most complex
    is_simple: bool
    reasons: list[str]
    recommended_model: str


class SimpleRequestDetector:
    """Scores a request's simplicity. Higher score = more complex."""

    SIMPLE_THRESHOLD = 3  # Score <= this -> simple request

    # Extractive task signals
    EXTRACTIVE_PATTERNS = [
        r"\bextract\b",
        r"\bfind\b.{0,20}\bin\b",
        r"\bwhat is the\b",
        r"\bidentify\b",
        r"\bclassify\b",
        r"\blabel\b",
        r"\byes or no\b",
        r"\btrue or false\b",
        r"\bspam\b",
        r"\bsentiment\b",
    ]

    # Generative task signals
    GENERATIVE_PATTERNS = [
        r"\bwrite\b",
        r"\bgenerate\b",
        r"\bcreate\b",
        r"\bcompose\b",
        r"\banalyze\b",
        r"\bexplain\b",
        r"\bsynthesize\b",
        r"\bsummarize\b",
        r"\bdescribe\b",
        r"\bcompare\b",
    ]

    # Tool-use signals
    TOOL_PATTERNS = [
        r"\bsearch\b",
        r"\blook up\b",
        r"\bfetch\b",
        r"\bcheck the\b",
        r"\bget the current\b",
        r"\bwhat is today\b",
    ]

    def assess(self, prompt: str, has_conversation_history: bool = False) -> ComplexityAssessment:
        prompt_lower = prompt.lower()
        word_count = len(prompt.split())
        score = 0
        reasons = []

        # Criterion 1: Single-turn
        if has_conversation_history:
            score += 3
            reasons.append("multi-turn conversation")

        # Criterion 2: Input length
        if word_count > 300:
            score += 3
            reasons.append(f"long input ({word_count} words)")
        elif word_count > 150:
            score += 1
            reasons.append(f"medium input ({word_count} words)")

        # Criterion 3: Tool use
        tool_hits = sum(1 for p in self.TOOL_PATTERNS if re.search(p, prompt_lower))
        if tool_hits >= 1:
            score += 2
            reasons.append("possible tool use required")

        # Criterion 4: Extractive vs generative
        extractive_hits = sum(1 for p in self.EXTRACTIVE_PATTERNS if re.search(p, prompt_lower))
        generative_hits = sum(1 for p in self.GENERATIVE_PATTERNS if re.search(p, prompt_lower))

        if generative_hits > extractive_hits:
            score += 2
            reasons.append(f"generative signals ({generative_hits} hits)")
        elif extractive_hits > 0:
            reasons.append(f"extractive signals ({extractive_hits} hits)")

        is_simple = score <= self.SIMPLE_THRESHOLD
        model = "gpt-4o-mini" if is_simple else "gpt-4o"

        return ComplexityAssessment(
            score=score,
            is_simple=is_simple,
            reasons=reasons if reasons else ["no complexity signals detected"],
            recommended_model=model,
        )

Testing it:

detector = SimpleRequestDetector()

# Should be simple
tests = [
    ("Extract the ZIP code from: '123 Main St, Austin TX 78701'", False),
    ("Is this email spam? 'Claim your prize now! Limited time!'", False),
    ("Classify as positive/negative: 'Great product, fast shipping!'", False),
    # Should be complex
    ("Write a detailed analysis of how containerization changed software deployment practices.", False),
    ("Compare and contrast the approaches taken by AWS and GCP for serverless computing.", False),
    ("Search for the current price of AAPL and summarize recent news.", False),
]

for prompt, has_history in tests:
    result = detector.assess(prompt, has_conversation_history=has_history)
    print(f"Score: {result.score:2d} | {'SIMPLE' if result.is_simple else 'COMPLEX':7s} | {prompt[:60]}...")

Output:

Score:  0 | SIMPLE  | Extract the ZIP code from: '123 Main St, Austin TX 78701...
Score:  0 | SIMPLE  | Is this email spam? 'Claim your prize now! Limited time!'...
Score:  0 | SIMPLE  | Classify as positive/negative: 'Great product, fast shippi...
Score:  4 | COMPLEX | Write a detailed analysis of how containerization changed s...
Score:  4 | COMPLEX | Compare and contrast the approaches taken by AWS and GCP fo...
Score:  2 | SIMPLE  | Search for the current price of AAPL and summarize recent n...

The last case is a miss, a search request that scores as simple because the tool-use signal did not fire on "price of AAPL." This is why you verify outputs.

What Quality Looks Like When You Downgrade Correctly

The important empirical question: does quality actually hold?

For extraction tasks, the answer is yes. Testing across 500 extraction prompts (entity extraction, classification, format detection):

import json
from openai import OpenAI

client = OpenAI()


def benchmark_extraction(prompts_and_labels: list[tuple[str, str]]) -> dict:
    """Compare gpt-4o-mini vs gpt-4o on extraction tasks."""
    results = {"mini": {"correct": 0, "total": 0}, "gpt4o": {"correct": 0, "total": 0}}

    for prompt, expected in prompts_and_labels:
        for model_key, model_id in [("mini", "gpt-4o-mini"), ("gpt4o", "gpt-4o")]:
            response = client.chat.completions.create(
                model=model_id,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0,
                max_tokens=50,
            )
            output = response.choices[0].message.content.strip().lower()
            correct = expected.lower() in output
            results[model_key]["total"] += 1
            if correct:
                results[model_key]["correct"] += 1

    for key in results:
        r = results[key]
        r["accuracy"] = r["correct"] / r["total"] if r["total"] > 0 else 0

    return results


# Sample test set
test_cases = [
    ("Classify as SPAM or HAM: 'Meeting at 3pm tomorrow in conf room B'", "ham"),
    ("Extract the currency: 'Total: $1,249.99'", "usd"),
    ("Is this positive or negative? 'Terrible experience, never coming back'", "negative"),
    ("Extract the domain from: 'Send to admin@company.io'", "company.io"),
]

results = benchmark_extraction(test_cases)
print(f"gpt-4o-mini accuracy: {results['mini']['accuracy']:.1%}")
print(f"gpt-4o accuracy:      {results['gpt4o']['accuracy']:.1%}")

On extraction tasks, gpt-4o-mini typically matches gpt-4o within 1-2 percentage points. On creative generation tasks, the gap is measurable, gpt-4o produces longer, more coherent, better-structured output. The routing decision is: which category does this request fall into?

The Full Routing Pattern

Detect simple -> route cheap -> verify quality -> report outcome.

from openai import OpenAI

client = OpenAI()
detector = SimpleRequestDetector()


def route_with_verification(
    prompt: str,
    has_history: bool = False,
    output_validator: callable | None = None,
) -> dict:
    # Step 1: Detect simplicity
    assessment = detector.assess(prompt, has_conversation_history=has_history)
    model = assessment.recommended_model

    # Step 2: Route cheap (or expensive)
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    content = response.choices[0].message.content

    # Step 3: Verify quality
    quality_ok = True
    if output_validator is not None:
        quality_ok = output_validator(content)
    elif response.choices[0].finish_reason != "stop":
        quality_ok = False

    # Step 4: Report outcome (for tracking; upgrade if needed)
    outcome = {
        "model_used": model,
        "complexity_score": assessment.score,
        "quality_verified": quality_ok,
        "reasons": assessment.reasons,
    }

    # Optionally retry with better model on failure
    if not quality_ok and model == "gpt-4o-mini":
        fallback_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        content = fallback_response.choices[0].message.content
        outcome["fallback_used"] = True
        outcome["model_used"] = "gpt-4o"

    return {"content": content, "outcome": outcome}


# Usage
def validate_sentiment_label(output: str) -> bool:
    clean = output.strip().lower()
    return any(label in clean for label in ["positive", "negative", "neutral"])


result = route_with_verification(
    "Classify the sentiment of: 'Delivery was fast but packaging was damaged'",
    output_validator=validate_sentiment_label
)
print(result["content"])
print(result["outcome"])

Kalibr as the Clean Version

The pattern above, detect, route, verify, report, is what Kalibr's Router implements without the boilerplate:

import kalibr  # Must be first import
from kalibr import Router, Outcome

router = Router(
    paths=[
        {"model": "openai/gpt-4o-mini", "weight": 0.75},
        {"model": "openai/gpt-4o",      "weight": 0.25},
    ],
    success_when="response.choices[0].finish_reason == 'stop' and len(response.choices[0].message.content.strip()) > 0",
    goal_id="simple_request_downgrade"
)


def handle_request(prompt: str, validator: callable | None = None) -> str:
    response, request_id = router.completion(
        messages=[{"role": "user", "content": prompt}],
        return_request_id=True
    )

    content = response.choices[0].message.content

    # Report quality signal
    success = validator(content) if validator else True
    router.report_outcome(
        request_id=request_id,
        outcome=Outcome.SUCCESS if success else Outcome.FAILURE
    )

    return content

Thompson Sampling observes which model produces valid outputs on your actual requests. When gpt-4o-mini consistently succeeds on your simple requests, it gets more traffic. When it starts failing, say, because a new prompt type is harder than expected, the router shifts back toward gpt-4o automatically.

The difference from the handwritten detector: the router learns from outcomes, not from a fixed scoring function. If your extraction tasks are harder than the detector thinks, the failure signal propagates back and the router compensates.

Summary

Automatic model downgrade for simple requests works when "simple" is defined concretely:

Single-turn
Short input
No tool use
Extractive, not generative
Verifiable output

The quality evidence is clear: on extraction and classification tasks, gpt-4o-mini performs at parity with gpt-4o at 15x lower cost. On generative tasks, quality degrades measurably.

The routing pattern is: detect simple -> route cheap -> verify quality -> report outcome. The handwritten version of this is 60-80 lines of code that you own and maintain. Kalibr's Router is the same pattern with Thompson Sampling instead of a fixed scoring function. It learns which requests are actually simple from your real traffic, not from heuristics you wrote once.

Kalibr keeps complex AI agents running without human intervention.

Get started free