Self-hosting open models vs an API: where the cost actually crosses over
TL;DR
Self-hosting an open-weight model wins on cost only at high, steady throughput, or when data residency, fine-tuning, or controlled latency force your hand. The per-token API price hides the real comparison: a self-hosted GPU bills you for idle time, ops, evals, and upgrades whether the tokens flow or not. For most teams the answer is a hybrid: self-host the cheap high-volume path, call an API for the hard tail.
The honest answer: self-hosting an open-weight model beats a commercial API on cost only when your throughput is high and steady, or when data residency, fine-tuning, or tight latency control take the decision out of your hands. The per-token API price you see on a pricing page is the wrong number to compare against, because a self-hosted GPU charges you for every idle second, plus ops time, evals, and upgrades. Below is how I actually model the crossover before recommending one to a client.
Why isn't the API per-token price the real comparison?
A commercial API gives you marginal pricing: you pay per token in and per token out, and nothing when you're idle. That property is worth a lot. It means your cost line tracks demand exactly, with no committed spend.
A self-hosted model inverts this. You rent or own a GPU, and it bills the same whether it's saturated or sat at 4% utilisation overnight. So the comparison isn't "API token price vs open-model token price" — it's "API token price vs the fully-loaded cost of a GPU divided by the tokens you actually push through it." Utilisation is the whole game.
The fully-loaded cost of self-hosting includes:
- Compute — GPU rental (an H100 is typically in the region of $2–4/hr on-demand, less on committed or spot) or amortised purchase plus power and rack.
- Idle capacity — the gap between provisioned and used. A box sized for peak sits mostly idle off-peak.
- Autoscaling overhead — cold starts on large weights are slow, so you either keep warm headroom (idle cost) or accept latency spikes.
- Ops and on-call — someone owns serving stack upgrades, driver/CUDA breakage, OOMs, and 3am pages.
- Eval and upgrades — re-benchmarking each time you change model, quantisation, or serving engine.
None of that appears on an API invoice. All of it appears on yours.
Where does self-hosting actually win?
Four situations, and they're fairly specific:
- High, steady throughput. If you can keep a GPU genuinely busy — say 60%+ utilisation around the clock — the per-token economics swing hard in your favour, because you're amortising fixed cost across a large, predictable token volume.
- Data residency / VPC requirements. When the data legally or contractually cannot leave your boundary, "cheaper" stops being the question. Self-hosting (or a model running inside your own cloud VPC) is the only option that keeps prompts and completions off a third party entirely.
- Fine-tuned or specialised models. If a 7B–13B model fine-tuned on your domain matches a frontier model on your task, you're now running a small, cheap model at scale instead of paying frontier prices per call.
- Controlled latency. Dedicated capacity means no noisy-neighbour variance and no provider-side queueing. If your p99 latency budget is tight and non-negotiable, owning the serving path is sometimes the only way to hit it.
Where does the API win?
- Spiky or low volume. If your traffic is bursty or you're doing thousands rather than millions of calls a day, you'll never keep a GPU busy enough to beat marginal API pricing. You'd be paying for idle silicon.
- Frontier quality without the ops burden. The best closed models still lead on the hardest reasoning and long-context tasks, and you get them with zero serving stack to maintain.
- Early-stage / changing requirements. Before you know your real traffic shape and quality bar, committing to GPU infrastructure is premature optimisation.
What does the crossover look like in numbers?
Illustrative, and I'd insist you redo this with your own traffic — but here's the shape. Assume a workload of mixed input/output tokens, and compare a commercial API against a single self-hosted GPU node serving an open-weight model. Self-hosted unit cost is (monthly node cost) ÷ (tokens served that month), so it falls as utilisation rises.
| Scenario | Monthly tokens | API cost (in the region of) | Self-host node cost | Self-host effective rate |
|---|---|---|---|---|
| Low / spiky | 50M | ~$250 | ~$2,200 (1× GPU, ~10% util) | terrible — API wins clearly |
| Medium | 500M | ~$2,500 | ~$2,200 (1× GPU, ~45% util) | roughly level |
| High, steady | 5B | ~$25,000 | ~$4,400 (2× GPU, ~70% util) | self-host wins by ~5x |
The pattern is the point, not the digits: at low volume the GPU's fixed cost dominates and the API is far cheaper; somewhere in the mid-hundreds-of-millions of tokens the lines cross; at billions of steady tokens self-hosting pulls clearly ahead. Add ops headcount (realistically a fraction of an engineer's time, but not zero) and the crossover moves right.
A back-of-envelope I actually use:
# Self-hosted effective $/1M tokens
node_hourly = 3.50 # GPU on-demand, USD/hr
hours_per_month = 730
util = 0.45 # fraction of capacity actually used
throughput_tps = 2500 # tokens/sec at full load for this model+GPU
tokens_per_month = throughput_tps * util * hours_per_month * 3600
node_monthly = node_hourly * hours_per_month
cost_per_1m = node_monthly / (tokens_per_month / 1_000_000)
# Compare cost_per_1m against the API's blended $/1M.
# Then add ~20-40% for ops, eval, idle headroom, autoscale slack.
That ops/idle uplift is the number people forget. A serving rate that looks competitive on paper often isn't once you load the human cost.
What does the data-residency angle change?
It changes the question from cost to feasibility. For clients with trade-secret-grade data or strict regulatory boundaries, the prompts themselves are the asset you're protecting. Running an open-weight model inside your own VPC — or against a cloud-hosted model that contractually stays within your tenancy and region — means sensitive context never transits a third-party endpoint.
Here, the comparison isn't "is self-hosting cheaper", it's "what's the cheapest way to satisfy the constraint". Sometimes that's self-hosting; sometimes it's a managed model offered inside your cloud's VPC so you skip the serving ops while keeping data in-boundary. I'd cost both before assuming you must own GPUs.
What's the hybrid pattern I usually land on?
Route by difficulty. Send the high-volume, well-understood requests to a self-hosted (often fine-tuned) small model where you've already paid for the capacity, and escalate only the hard tail to a frontier API.
def route(request):
if request.is_sensitive:
return self_hosted(request) # never leaves the VPC
if classifier.is_easy(request):
return self_hosted(request) # cheap, high volume, GPU you already own
return frontier_api(request) # hard tail only — pay for quality
In practice the easy path is the bulk of traffic, so you keep your GPU busy (good utilisation) and only pay frontier rates on the small fraction that genuinely needs it. You also get a graceful fallback: if the self-hosted node is unhealthy, spill to the API.
When is self-hosting worth it — and when isn't it?
Worth it when: throughput is high and steady, data must stay in-boundary, a fine-tuned small model hits your quality bar, or you need latency you can guarantee. Not worth it when: traffic is spiky or low, you need frontier quality on hard tasks, you don't yet know your traffic shape, or you lack the appetite to own a serving stack and its on-call.
The trade-off underneath all of it is fixed versus marginal cost. APIs convert inference into pure marginal spend; self-hosting converts it into a fixed asset you must keep busy. Beneath the crossover volume, marginal wins. Above it, fixed wins. The hybrid splits the difference deliberately — and for most teams that's where I'd start, then push more traffic onto self-hosted capacity as your volume and confidence grow.
Related service
AI systems architecture & LLM integration