OvertimeLabs.ai
Performance3 min read29 July 2025

Cutting p95 latency without new hardware

TL;DR

You can usually cut p95 tail latency by 40% or more without adding hardware. Measure with distributed tracing, relieve the database with the right indexes and key-set pagination, cap concurrency and add circuit breakers, then shed load gracefully. On one checkout API this took p95 from 1,500ms to 230ms with zero new instances.

Most "we need a bigger instance" conversations are premature. Tail latency is almost always a query plan, an index, a pagination pattern or an unbounded concurrency problem — and all four are cheaper to fix than to scale around. Here's the playbook I use to take a checkout API from a 1,500ms p95 to 250ms at 500 RPS, on the same hardware.

What are we actually trying to hit?

The objective on this job was concrete: bring checkout API p95 from 1,500ms to 250ms under 500 RPS, without adding hardware.

The reason p95 and not average: higher-than-300ms p95 correlated with a 12% cart-abandonment uplift across a 4.2M-session sample. The tail is the revenue, so the tail is the target.

Why it matters in general:

  • User perception is driven by tail latency, not the average.
  • Cost: hardware scaling is 5–15× more expensive than query and index fixes.
  • Ops: bursting to mask a slow path just hides the bug and adds noisy-neighbour risk.

Step 1 — Measure (15 minutes)

You cannot fix what you cannot see. Add distributed tracing and look at the waterfall before touching anything.

# FastAPI example
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app, excluded_urls="health")

Export spans to Grafana Tempo, then plot p50, p95 and p99 for each span. Quick heuristic: if p99 > 2 × p95 or p95 > 3 × p50, you have jitter bombs — a few requests hitting a slow path rather than a uniform slowdown.

Step 2 — Relieve the database (1–2 hours)

This is where most of the win lives. The three highest-yield moves:

SymptomActionp95 beforep95 after
Seq scan on (customer_id, created_at)CREATE INDEX idx_orders_cust_created ON orders(customer_id, created_at DESC);890ms320ms
Large OFFSET usageSwitch to key-set pagination (WHERE id < :cursor)650ms220ms
Hot lookup (SELECT * FROM settings WHERE tenant_id = ?)Redis cache (EX 10)2ms350µs

The goal is zero shared-buffer reads on hot queries.

Step 3 — Control concurrency (30 minutes)

Unbounded concurrency turns a small slowdown into a queue. Cap it.

# PostgreSQL configuration
max_connections = 64
statement_timeout = 700ms
shared_preload_libraries = 'pg_stat_statements'

Pool size ≈ vCPU × 2 (usually 32–64). Set a statement_timeout so a single slow query can't hold a connection forever. Then add circuit breakers around downstream calls:

import httpx, tenacity
 
client = httpx.Client(timeout=0.5)
 
@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_exponential(multiplier=1, min=0.1, max=2),
)
def call_billing_service(invoice_id):
    try:
        response = client.get(f"https://billing/api/v1/invoice/{invoice_id}")
        response.raise_for_status()
        return response.json()
    except httpx.TimeoutException:
        raise Exception("Billing service timeout")

And push anything that doesn't need to happen in the request — image resize, PDFs, emails — onto a queue (RQ / Celery / SQS).

Step 4 — Back-pressure and load shedding (15 minutes)

When you're past capacity, fail fast. Respond with 429/503 early — never after the system has already melted down — and let clients back off exponentially. A token bucket per IP is enough to stop one caller taking down the tail for everyone.

Step 5 — Verify and set an error budget

Make it stick with an SLO and an alert.

# Prometheus alerting rule
groups:
  - name: latency_slo
    rules:
      - alert: HighLatencyBreach
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.25
        for: 15m
        labels:
          severity: warning

Plot p95 daily, target ≤ 250ms, and write the SLO down ("≤ 1% of requests breach 250ms p95 per week"). Teams that track SLO burn-down fix tails roughly 3× faster.

What it added up to

On this checkout API: −84% p95 (1,500ms → 230ms) and zero hardware spend change. No new instances — just measurement, the right index, key-set pagination, a cache, bounded concurrency and graceful shedding.

If you want a guided trace and query-plan review on your own system, that's a good use of a short call.

FAQ

Why optimise p95 instead of average latency?

Users feel the tail, not the mean. A good average hides the slow requests that drive abandonment — on a 4.2M-session sample, p95 above ~300ms correlated with a 12% uplift in cart abandonment. Track p50, p95 and p99 together; the gap between them tells you where the jitter is.

When is scaling hardware actually the right call?

After you've fixed the query plans, indexes, pagination and concurrency — not before. Hardware scaling is typically 5–15× more expensive than the query and index fixes that remove the bottleneck, and bursting often just hides the underlying bug.

How do I know I have a tail-latency problem and not a throughput one?

Plot p50, p95 and p99 per span. If p99 is more than 2× p95, or p95 is more than 3× p50, you have jitter bombs — a few requests hitting a slow path (a seq scan, a lock, a cold cache) rather than a uniform slowdown.

Want this built — or reviewed — properly?

Book a 15-minute call and tell me what you're working on.