Why optimise p95 instead of average latency?

Users feel the tail, not the mean. A good average hides the slow requests that drive abandonment — on a 4.2M-session sample, p95 above ~300ms correlated with a 12% uplift in cart abandonment. Track p50, p95 and p99 together; the gap between them tells you where the jitter is.

When is scaling hardware actually the right call?

After you've fixed the query plans, indexes, pagination and concurrency — not before. Hardware scaling is typically 5–15× more expensive than the query and index fixes that remove the bottleneck, and bursting often just hides the underlying bug.

Performance3 min read29 July 2025

Cutting p95 latency without new hardware

Q: How do I know I have a tail-latency problem and not a throughput one?

Plot p50, p95 and p99 per span. If p99 is more than 2× p95, or p95 is more than 3× p50, you have jitter bombs — a few requests hitting a slow path (a seq scan, a lock, a cold cache) rather than a uniform slowdown.

TL;DR

You can usually cut p95 tail latency by 40% or more without adding hardware. Measure with distributed tracing, relieve the database with the right indexes and key-set pagination, cap concurrency and add circuit breakers, then shed load gracefully. On one checkout API this took p95 from 1,500ms to 230ms with zero new instances.

Most "we need a bigger instance" conversations are premature. Tail latency is almost always a query plan, an index, a pagination pattern or an unbounded concurrency problem — and all four are cheaper to fix than to scale around. Here's the playbook I use to take a checkout API from a 1,500ms p95 to 250ms at 500 RPS, on the same hardware.

What are we actually trying to hit?

The objective on this job was concrete: bring checkout API p95 from 1,500ms to 250ms under 500 RPS, without adding hardware.

The reason p95 and not average: higher-than-300ms p95 correlated with a 12% cart-abandonment uplift across a 4.2M-session sample. The tail is the revenue, so the tail is the target.

Why it matters in general:

User perception is driven by tail latency, not the average.
Cost: hardware scaling is 5–15× more expensive than query and index fixes.
Ops: bursting to mask a slow path just hides the bug and adds noisy-neighbour risk.

Step 1 — Measure (15 minutes)

You cannot fix what you cannot see. Add distributed tracing and look at the waterfall before touching anything.

# FastAPI example
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app, excluded_urls="health")

Export spans to Grafana Tempo, then plot p50, p95 and p99 for each span. Quick heuristic: if p99 > 2 × p95 or p95 > 3 × p50, you have jitter bombs — a few requests hitting a slow path rather than a uniform slowdown.

Step 2 — Relieve the database (1–2 hours)

This is where most of the win lives. The three highest-yield moves:

Symptom	Action	p95 before	p95 after
Seq scan on `(customer_id, created_at)`	`CREATE INDEX idx_orders_cust_created ON orders(customer_id, created_at DESC);`	890ms	320ms
Large `OFFSET` usage	Switch to key-set pagination (`WHERE id < :cursor`)	650ms	220ms
Hot lookup (`SELECT * FROM settings WHERE tenant_id = ?`)	Redis cache (`EX 10`)	2ms	350µs

The goal is zero shared-buffer reads on hot queries.

Step 3 — Control concurrency (30 minutes)

Unbounded concurrency turns a small slowdown into a queue. Cap it.

# PostgreSQL configuration
max_connections = 64
statement_timeout = 700ms
shared_preload_libraries = 'pg_stat_statements'

Pool size ≈ vCPU × 2 (usually 32–64). Set a statement_timeout so a single slow query can't hold a connection forever. Then add circuit breakers around downstream calls:

import httpx, tenacity
 
client = httpx.Client(timeout=0.5)
 
@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_exponential(multiplier=1, min=0.1, max=2),
)
def call_billing_service(invoice_id):
    try:
        response = client.get(f"https://billing/api/v1/invoice/{invoice_id}")
        response.raise_for_status()
        return response.json()
    except httpx.TimeoutException:
        raise Exception("Billing service timeout")

And push anything that doesn't need to happen in the request — image resize, PDFs, emails — onto a queue (RQ / Celery / SQS).

Step 4 — Back-pressure and load shedding (15 minutes)

When you're past capacity, fail fast. Respond with 429/503 early — never after the system has already melted down — and let clients back off exponentially. A token bucket per IP is enough to stop one caller taking down the tail for everyone.

Step 5 — Verify and set an error budget

Make it stick with an SLO and an alert.

# Prometheus alerting rule
groups:
  - name: latency_slo
    rules:
      - alert: HighLatencyBreach
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.25
        for: 15m
        labels:
          severity: warning

Plot p95 daily, target ≤ 250ms, and write the SLO down ("≤ 1% of requests breach 250ms p95 per week"). Teams that track SLO burn-down fix tails roughly 3× faster.

What it added up to

On this checkout API: −84% p95 (1,500ms → 230ms) and zero hardware spend change. No new instances — just measurement, the right index, key-set pagination, a cache, bounded concurrency and graceful shedding.

If you want a guided trace and query-plan review on your own system, that's a good use of a short call.