Cutting p95 latency without new hardware
TL;DR
You can usually cut p95 tail latency by 40% or more without adding hardware. Measure with distributed tracing, relieve the database with the right indexes and key-set pagination, cap concurrency and add circuit breakers, then shed load gracefully. On one checkout API this took p95 from 1,500ms to 230ms with zero new instances.
Most "we need a bigger instance" conversations are premature. Tail latency is almost always a query plan, an index, a pagination pattern or an unbounded concurrency problem — and all four are cheaper to fix than to scale around. Here's the playbook I use to take a checkout API from a 1,500ms p95 to 250ms at 500 RPS, on the same hardware.
What are we actually trying to hit?
The objective on this job was concrete: bring checkout API p95 from 1,500ms to 250ms under 500 RPS, without adding hardware.
The reason p95 and not average: higher-than-300ms p95 correlated with a 12% cart-abandonment uplift across a 4.2M-session sample. The tail is the revenue, so the tail is the target.
Why it matters in general:
- User perception is driven by tail latency, not the average.
- Cost: hardware scaling is 5–15× more expensive than query and index fixes.
- Ops: bursting to mask a slow path just hides the bug and adds noisy-neighbour risk.
Step 1 — Measure (15 minutes)
You cannot fix what you cannot see. Add distributed tracing and look at the waterfall before touching anything.
# FastAPI example
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app, excluded_urls="health")Export spans to Grafana Tempo, then plot p50, p95 and p99 for each span. Quick heuristic: if p99 > 2 × p95 or p95 > 3 × p50, you have jitter bombs — a few requests hitting a slow path rather than a uniform slowdown.
Step 2 — Relieve the database (1–2 hours)
This is where most of the win lives. The three highest-yield moves:
| Symptom | Action | p95 before | p95 after |
|---|---|---|---|
Seq scan on (customer_id, created_at) | CREATE INDEX idx_orders_cust_created ON orders(customer_id, created_at DESC); | 890ms | 320ms |
Large OFFSET usage | Switch to key-set pagination (WHERE id < :cursor) | 650ms | 220ms |
Hot lookup (SELECT * FROM settings WHERE tenant_id = ?) | Redis cache (EX 10) | 2ms | 350µs |
The goal is zero shared-buffer reads on hot queries.
Step 3 — Control concurrency (30 minutes)
Unbounded concurrency turns a small slowdown into a queue. Cap it.
# PostgreSQL configuration
max_connections = 64
statement_timeout = 700ms
shared_preload_libraries = 'pg_stat_statements'Pool size ≈ vCPU × 2 (usually 32–64). Set a statement_timeout so a single slow query can't hold a connection forever. Then add circuit breakers around downstream calls:
import httpx, tenacity
client = httpx.Client(timeout=0.5)
@tenacity.retry(
stop=tenacity.stop_after_attempt(3),
wait=tenacity.wait_exponential(multiplier=1, min=0.1, max=2),
)
def call_billing_service(invoice_id):
try:
response = client.get(f"https://billing/api/v1/invoice/{invoice_id}")
response.raise_for_status()
return response.json()
except httpx.TimeoutException:
raise Exception("Billing service timeout")And push anything that doesn't need to happen in the request — image resize, PDFs, emails — onto a queue (RQ / Celery / SQS).
Step 4 — Back-pressure and load shedding (15 minutes)
When you're past capacity, fail fast. Respond with 429/503 early — never after the system has already melted down — and let clients back off exponentially. A token bucket per IP is enough to stop one caller taking down the tail for everyone.
Step 5 — Verify and set an error budget
Make it stick with an SLO and an alert.
# Prometheus alerting rule
groups:
- name: latency_slo
rules:
- alert: HighLatencyBreach
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.25
for: 15m
labels:
severity: warningPlot p95 daily, target ≤ 250ms, and write the SLO down ("≤ 1% of requests breach 250ms p95 per week"). Teams that track SLO burn-down fix tails roughly 3× faster.
What it added up to
On this checkout API: −84% p95 (1,500ms → 230ms) and zero hardware spend change. No new instances — just measurement, the right index, key-set pagination, a cache, bounded concurrency and graceful shedding.
If you want a guided trace and query-plan review on your own system, that's a good use of a short call.
Related service
AI systems architecture & LLM integration