OvertimeLabs.ai
AI architecture7 min read30 May 2026

Model routing: stop sending every request to your biggest model

TL;DR

Most production LLM traffic is simple enough for a small, cheap model; only a minority genuinely needs a frontier model. Route requests by rules, a lightweight classifier, or a cheap-first cascade that escalates only on low confidence or failed validation. Done with proper eval gates and per-task quality-per-dollar measurement, this typically cuts spend several-fold while holding answer quality.

If you send every request to your largest model, you are almost certainly overpaying by a wide margin. In most production workloads the bulk of traffic — classification, extraction, short rewrites, routine Q&A — is comfortably handled by a small model costing a fraction as much per token. The job of model routing is to send each request to the cheapest model that still meets your quality bar, and to prove that bar is being met rather than assume it.

The economics are stark. Frontier models such as Claude Opus or GPT-class flagships are commonly priced an order of magnitude or more above their small siblings (Claude Haiku, GPT mini-tier, Gemini Flash). If 80% of your traffic can move from a flagship to a model that costs roughly a tenth as much, your blended cost per request falls dramatically — often by several times — even though the hard 20% still hits the expensive model. That is the entire pitch, and it holds up as long as routing never silently makes answers worse.

Which requests actually need a frontier model?

Start by profiling traffic, not by guessing. In my experience three buckets emerge.

The first is genuinely trivial work: intent classification, language detection, field extraction from a known schema, sentiment, short summarisation. A small model handles these at near-flagship accuracy and a fraction of the cost.

The second is the messy middle: multi-step reasoning, code with subtle edge cases, long-context synthesis, ambiguous instructions. This is where the big model earns its price.

The third is everything that looks hard but isn't — long prompts that are mostly boilerplate, or "complex" tasks that are really a chain of simple ones. Decomposing these into small steps often lets a cheap model do most of the work, with the expensive model reserved for the one genuinely hard sub-task.

You cannot route until you know roughly what proportion of your traffic falls where. Sample a few hundred real requests, label them, and you have your routing budget.

What are the routing strategies, and when does each fit?

There are three practical approaches, in ascending order of sophistication and effort.

Heuristic / rules. Route on cheap signals you already have: prompt length, the endpoint or feature that produced the request, presence of code, token count, user tier, whether tools are required. Rules are transparent, free to run, and easy to debug. They are also brittle — they encode your assumptions, not the request's actual difficulty. Start here; most teams get 60–70% of the benefit from a dozen rules.

A small classifier. Train or prompt a tiny model to predict difficulty (or the right tier) before the main call. This catches cases rules miss — a short prompt that is conceptually hard, a long one that is trivial. The classifier itself must be cheap relative to the savings, so a small model or even an embedding-plus-logistic-regression layer is the right tool, not another flagship call.

A cascade. Send everything to the cheap model first. Accept its answer only if it passes a check — a confidence signal, structured-output validation, or a fast verifier — and escalate to the bigger model only on failure. This is the most robust pattern because it routes on the actual output, not a prediction of difficulty. The cost is added latency on the escalation path and the need for a reliable check.

In practice I mix these: rules as a coarse first filter, a cascade for the ambiguous remainder.

What does the routing logic actually look like?

Here is the shape of a cheap-first cascade with a validation gate. The principle that matters: escalation is driven by a check on the output, not faith.

def route(request) -> Response:
    # 1. Cheap rules first — obvious cases skip the cascade
    if request.feature in CHEAP_FEATURES or request.token_count < 200:
        tier = "small"
    else:
        tier = "small"  # cascade still starts cheap; rules only force-skip
 
    # 2. Try the cheap model
    resp = call(MODELS["small"], request)
 
    # 3. Quality gate — structural + confidence
    if passes_gate(resp, request):
        return resp
 
    # 4. Escalate only on failure
    resp = call(MODELS["large"], request)
    log_escalation(request, reason="gate_failed")
    return resp
 
 
def passes_gate(resp, request) -> bool:
    # Structured-output validation: must parse against the expected schema
    if request.expects_json and not valid_schema(resp.text, request.schema):
        return False
    # Confidence / self-report threshold (calibrated against evals, not trusted raw)
    if resp.logprob_confidence is not None and resp.logprob_confidence < 0.75:
        return False
    # Cheap refusal / empty-answer detection
    if is_refusal_or_empty(resp.text):
        return False
    return True

The gate is the load-bearing part. A cascade without a real check is just a cheap model pretending to be a good one.

How do I stop routing from silently degrading answers?

This is the failure mode that kills routing projects: cost drops, nobody notices quality slipped, and three weeks later support tickets spike. Three guardrails prevent it.

An offline eval gate. Maintain a labelled eval set per task type — a few hundred examples with known-good outputs or a graded rubric. Before any routing change ships, run both the small and large models against it and compare. If the small model is within an acceptable delta on that task, it earns the traffic; if not, it doesn't. This is non-negotiable and it is the single highest-leverage piece of the whole exercise.

Confidence thresholds, calibrated. Raw model self-confidence is unreliable. Calibrate the threshold against your eval set — find the confidence level below which the small model's accuracy actually falls off, and escalate there.

Structured-output validation. When you expect JSON, a tool call, or a constrained format, validate it. A schema parse failure is a free, deterministic escalation signal — no model judgement required.

Then keep measuring in production: sample escalation rate, sample a slice of cheap-model answers for human or LLM-judge review, and alert if the escalation rate or judge score drifts.

How do I mix providers, and what is each good at?

Routing across providers buys you both cost-efficiency and resilience, at the price of more integration surface. Rough, current-as-of-writing characterisations:

  • Anthropic Claude — strong on long-context reasoning, code, and instruction-following; Haiku is a capable cheap tier, Sonnet a strong mid, Opus the heavyweight.
  • OpenAI GPT — broad capability and a wide tier ladder from mini to flagship; strong tool-use ecosystem.
  • Google Gemini — very large context windows and competitive Flash-tier pricing for high-volume cheap work.
  • Groq — not a model maker but an inference provider running open-weight models at very high tokens-per-second; the right home for latency-sensitive cheap-tier traffic.
  • Open-weight models (Llama, Qwen, Mistral) — self-hosted or via a provider, they win when data residency, fixed cost at scale, or fine-tuning control matter more than absolute peak quality.

The trap is coupling your routing logic to one provider's SDK. Put a thin abstraction over the call so a "tier" maps to a (provider, model) pair you can change in config, not in code.

How do I handle provider outages and rate limits?

Routing and failover share the same dispatch layer, so build them together. Wrap each tier with retry-with-backoff on 429/5xx, and define a fallback chain per tier: if the primary small model rate-limits, fall through to an equivalent small model on another provider before degrading the user experience. Keep the fallbacks quality-equivalent — failing over from a flagship to a tiny model during an outage is how you get the silent-degradation problem back. Track per-provider error rates so a struggling provider sheds traffic automatically rather than after a human notices.

What should I actually optimise for?

Not raw token price — quality-per-dollar on your tasks. A model that is half the price but needs two retries and escalates 40% of the time may cost more, all-in, than the obvious choice. The right metric is something like accuracy (on your eval set) per dollar of total spend, computed per task type, including escalation and retry overhead. Optimise that and the routing falls out of the numbers rather than out of vibes.

When is this worth it — and when not?

Model routing pays off when you have meaningful volume (the engineering cost amortises over millions of requests, not thousands), a real spread of difficulty in your traffic, and tasks you can actually evaluate. At those volumes a several-fold reduction in spend is typical and the eval/guardrail investment is cheap by comparison.

It is not worth it when volume is low — at a few thousand requests a month, just use a good mid-tier model and spend your time elsewhere. It is also a poor fit when every request is genuinely hard (routing has nothing to route to) or when you have no way to measure quality, because then you are flying blind and the silent-degradation risk outweighs the savings. Build the eval harness first; if you can't, don't route.

FAQ

How much can model routing actually save?

It depends on your traffic mix, but if around 80% of requests can move to a small model costing roughly a tenth as much per token, blended cost typically falls several-fold. Frame any figure as illustrative until you've measured your own escalation rate and retries.

Won't a cheaper model just give worse answers?

Only if you route blind. With an offline eval gate per task type, calibrated confidence thresholds, and structured-output validation, a request reaches the cheap model only when it's proven adequate there. The cascade escalates automatically when the gate fails.

Should I use rules, a classifier, or a cascade?

Start with rules — a dozen get you most of the benefit. Add a small classifier to catch difficulty that rules miss, and use a cascade (cheap first, escalate on failed validation) for ambiguous traffic, since it routes on the actual output rather than a prediction.

Want this built — or reviewed — properly?

Book a 15-minute call and tell me what you're working on.