Percentile Latency and Why Averages Lie
What This Concept Is
Latency is a distribution, not a number. When you summarize a distribution with one number, you lose almost everything that matters in production. Percentiles are the standard way to describe what users actually experience.
- p50 (median): half of requests are faster than this, half slower. Roughly the "typical" user.
- p95: 95% of requests are faster than this. Tail starts here.
- p99: 99% of requests are faster than this. One in a hundred is slower.
- p999 (three nines): 99.9% are faster. One in a thousand.
- p9999 (four nines): 99.99% are faster. One in ten thousand.
The mean (average) is not on this list for a reason: a distribution with a heavy tail can have a mean that makes the service look healthy while a significant fraction of users wait many seconds. At scale, "one in a hundred is a bad experience" is not a small number. A service with 10M requests per day at p99 = 2 seconds subjects 100,000 requests per day to a two-second wait. The Google SRE book states it in one line: "If you run a web service with an average latency of 100 ms at 1,000 requests per second, 1% of requests might easily take 5 seconds" - a factor-of-50 gap between mean and p99 that a mean dashboard cannot show.
Why It Matters Here
Production latency distributions are almost always right-skewed (a long tail toward slow). Every source of tail latency is cumulative: a user's page load touches many services, and their experience is governed by the slowest one. The tail-at-scale effect, named by Jeff Dean and Luiz Barroso (CACM 2013), means that at high fan-out, p99 of the slow service becomes p50 of the user's experience. Barroso's warning is blunt: "temporary high latency episodes which are unimportant in moderate size systems may come to dominate overall service performance at large scale."
If you optimize for the mean, you are optimizing for the experience of nobody in particular. If you alert on the mean, you will first hear about outages from customers, support tickets, and social media - every signal except the one you control.
Percentile-aware thinking also enables hedged requests and tied requests: the tail-at-scale paper's core mitigation is to re-issue a second copy of a request if the first is running slowly, canceling whichever loses. These only work when you know what "slowly" means at the distribution level.
Concrete Example
Two services, A and B, each handle 100 requests.
System A: 50 requests complete in 10ms, 49 complete in 20ms, and 1 request takes 1,980ms (a GC pause, a slow dependency, a retry - pick your story).
- Sum:
50 * 10 + 49 * 20 + 1 * 1980 = 500 + 980 + 1980 = 3,460ms. - Mean:
3,460 / 100 = 34.6ms. - p50: the 50th fastest is
10ms. - p95: the 95th fastest (5 slowest excluded from below) is about
20ms. - p99: the 99th fastest. One request is above this. That request is the
1,980msone. p99 ≈ 1,980ms.
System B: all 100 requests complete in exactly 34.6ms.
- Mean:
34.6ms. - p50, p95, p99, p999: all
34.6ms.
Same mean. Wildly different systems. System A has a user-visible outage hiding in its tail; System B does not. A latency dashboard showing only the mean would call these identical. An SLO written against the mean would let System A stay green forever.
Now scale it up. If System A runs 1,000 req/s at that distribution, then 10 req/s are experiencing the 1,980ms wait - 36,000 slow requests per hour, every hour, invisible to anyone who reads only the mean.
Fan-out illustration. A page that issues 10 parallel backend calls and waits for all of them experiences p_page = 1 - (1 - p_single)^10. If each backend has p99 = 1s (1% slow), the page sees a slow response 1 - 0.99^10 ≈ 9.6% of the time. A p99 backend becomes a p90 user experience. Increase fan-out to 100 and you approach "always slow."
Real-service example. A social-media feed composer at 3,000 RPS fans out to 40 microservices per timeline build, with a per-service p99 of 40ms. Naive expectation is "p99 of the page is 40ms." Actual math: probability that all 40 services return within 40ms is 0.99^40 ≈ 66.9%. So 33% of page loads experience at least one slow hop. Measured page p99 is closer to 150ms - the tail of the slowest-of-forty distribution, which is empirically modeled by the maximum of 40 draws. Hedged requests at the p95 threshold of each service cut this close to 10%; tied requests cut it further at roughly 2x cost. This is the exact arithmetic Dean and Barroso published in The Tail at Scale, and it is the reason every request-fanning search engine on earth spends money on hedging.
Common Confusion / Misconception
"We will average our p99 across regions to get a single global p99." You cannot average percentiles. The average of the 99th percentile of two different distributions is not the 99th percentile of their union. To get a true global p99, you need a histogram (e.g., the HDR Histogram data structure) that you merge across regions and then query. Averaging percentiles routinely underestimates tail latency and hides outages.
"Our p50 is fine, so users are fine." p50 is the median user. The dissatisfied users are at p95, p99, and above. At scale, those users are your support-ticket generators, your retention risk, and your credibility problem.
"Our fallback saves the tail." Sometimes. More often, a fallback introduces a new tail. If the primary p99 is 200ms and the fallback is "call a slower service on timeout," you have converted some 200ms+ requests into timeout + fallback_duration requests that are now 5s+ each. AWS's Builders' Library has an entire piece warning against this pattern precisely because it looks helpful and is not.
"Coordinated omission is a theory problem, not ours." Gil Tene's "How NOT to Measure Latency" talk exists because most load-test harnesses systematically underreport tail latency. When a request is slow, the harness waits for it before sending the next one, "omitting" all the slow requests that would have queued during that stall. The true p99 under overload can be 10-100x what the naive harness reports. Modern tools (wrk2, Gatling, hdrhistogram) correct for this; older loop-based load tests do not.
"The p99 is just one request in a hundred - we can ignore it." For a service at 10,000 req/s, p99 is 100 requests per second, 6,000 per minute, 360,000 per hour. That is not an edge case; that is a continuous stream of dissatisfied users.
"p99 is noisy so we use a longer aggregation window." Longer windows hide when the tail fires. If you publish a 5-minute p99, a 30-second spike gets smoothed into background. The SRE practice is to report p99 per-minute and alert on burn rate, not on a single window. Hide noise in the viewer (zoom out), not in the source.
How To Use It
For every user-facing latency metric:
- Instrument as a histogram, not as a mean. Prometheus
histogram_quantile, the HDR Histogram format, and most modern metric stores support this natively. The SRE book recommends buckets distributed approximately exponentially (e.g., factors of ~3:0-10ms,10-30ms,30-100ms,100-300ms, ...). - Report at minimum p50, p95, p99. Report p999 if fan-out is high or the business is latency-sensitive (trading, ads, games, voice).
- Write SLOs against a specific percentile. "99% of HTTP requests return in under 500ms, measured over 30 days" is an SLO. "Average latency is under 500ms" is a press release.
- When you see a p99 spike, do not look at the mean; look at what changed in the tail. Usually: GC pause, dependency timeout, retry storm, cold cache, head-of-line blocking, or a specific slow customer (a "noisy neighbor").
- Measure successful-request latency separately from failed-request latency. A quickly-returning
500inflates neither and hides the failure from the latency dashboard. - For high fan-out, consider tail-tolerant techniques: hedged requests (issue second copy after the original exceeds p95), tied requests (send both, cancel the loser), request priority (let the tail customer not steal capacity from p50 users).
- Log the ten slowest requests per minute with full trace IDs. This is the cheapest tail-debugging tool that exists. The sample is tiny (
10 * 60 = 600per hour), trivially cheap to store, and yet any recurring tail pathology (a specific customer, a specific shard, a specific code path) becomes visible within one working day. - Alert on percentile, not mean, and pair alerts with burn-rate. A single p99 breach for one minute is usually noise; a sustained burn of your error budget over 5-30 minutes is signal. Multi-window, multi-burn-rate alerts (see concept 07) replace the brittle single-threshold page.
Check Yourself
- If your service has mean latency 50ms and p99 latency 2000ms, how many seconds per hour of slow experience does a
10,000 req/sservice deliver? - Why can you not average p99 across two shards?
- Give one concrete mechanism for each: a fast p50 and a slow p99.
- A page issues 20 parallel backend calls, each with p99 = 500ms. Roughly what fraction of page loads sees at least one backend > 500ms?
- What is coordinated omission and why does it bias load-test numbers downward?
- Why does a 5-minute aggregation window on a p99 dashboard hide short spikes, and what alerting pattern recovers the signal?
- Hedged requests reduce tail latency. Name one workload where they are a bad idea and explain why.
Mini Drill or Application
Take any ten latency samples you can gather (real or synthetic). Sort them. Compute mean, p50, p90, p99 by hand. Now mutate the dataset by replacing the largest value with a value 100x bigger. Recompute. Notice: mean jumps, p50 does not, p99 does. Now run the same mutation but replace the median value with 100x itself; notice p50 jumps but p99 barely moves. This is why you need multiple percentiles - each one reports a different part of the distribution.
Extension drill. Take a real service you operate. Pull p50, p95, p99, and p999 for one hour. Compute the ratios p99/p50 and p999/p99. A healthy distributed service typically has p99/p50 between 2x and 10x; p999/p99 between 2x and 5x. If either ratio is much larger, you have a tail pathology (GC, retries, noisy neighbors) worth investigating. If either is much smaller (close to 1), you are probably suffering from coordinated omission or from a histogram with buckets too wide to capture the tail. Write down which of those your ratios suggest.
Transfer / Where This Shows Up Later
Percentile thinking is the backbone of every "is this system fast enough" conversation you will have from here on. Once you install it, you cannot un-install it.
- This module, concept 07 (SLOs): every SLO is a percentile claim over a window. "99.9% of requests within 500ms over 30 days" is the canonical form. Without histograms, this SLO is unmeasurable.
- This module, concept 10 (queueing): Little's Law gives you means; the tail comes from variance in service time. Any queueing mitigation (bounded queues, back-pressure) is really a tail-latency mitigation.
- This module, concept 11 (load shedding): hedged requests and priority shedding are direct applications of tail-at-scale thinking -- you let the slow request die so the fast one wins.
- S8 M5 (leadership): when the business asks "how fast is the app," the answer that survives is a percentile, not a mean. The mean is how dashboards lie to executives.
- S9 M5 (observability): histogram buckets are the expensive metric type; choosing bucket boundaries (and whether to store p999 at all) is a real cost decision you will defend in capacity reviews.
- S10 M4 (operational readiness): the readiness review asks, for every tier, "what percentile are you alerting on, and what is your coordinated-omission story?" If you cannot answer, the tier is not ready.
A quiet corollary: percentiles compose poorly across hops, so your mental model for a request path is add p50s, multiply tails. A five-hop call where each hop has p99 = 50ms has a median around 250ms but a p99 much worse than 250ms because any single hop can be the slow one.
A leadership-level corollary: when you present percentile numbers to non-engineers, always include one number with a unit the user would feel (e.g., "1% of checkouts take >2 seconds, which is 12,000 checkouts per day"). Raw p99 means nothing to a VP; user-feeling numbers mean everything.
Read This Only If Stuck
Local chunks (book anchors)
- System Design Primer: Powers of Two and Latency Numbers -- keep the "numbers every programmer should know" table mental-model adjacent; p99s you cannot justify against it are usually measurement artifacts.
- System Design Primer: Latency vs Throughput -- the clean two-quantity framing; percentile-latency is its distributional version.
- System Design Primer: Availability Patterns -- "fail open" / "graceful degradation" is how you keep fast p99 when a dependency is the tail source.
- FoSA: Measuring Architecture Characteristics -- the "what to instrument first" chapter; percentile latency is the first row.
- FoSA: Architecture Characteristics Defined -- performance as a distinct characteristic; the chapter insists you pick a measurable one.
- FoSA: Cross-Cutting Architecture Characteristics -- tail-latency concerns cross every layer; treat them as a cross-cutting concern, not a service concern.
External canonical references
- Google SRE Book, Monitoring Distributed Systems: Worrying About Your Tail -- the section that makes the "p99 of a backend becomes p50 of a page" argument precisely.
- Jeff Dean and Luiz Barroso, The Tail at Scale (CACM 2013) -- the canonical paper; read sections on hedged and tied requests in full.
- Gil Tene, How NOT to Measure Latency -- the 45-minute talk every senior engineer should have watched. Coordinated omission and HdrHistogram are covered here.
- Gil Tene, HdrHistogram -- the open-source histogram structure that actually captures wide-range latencies with fixed memory; the operational pattern you should copy.
- AWS Builders' Library, Avoiding fallback in distributed systems -- why fallbacks that look good in the mean destroy the tail; contains production war stories.
- Marc Brooker, Latency and The Great Ones -- a principal engineer's blog on why nines and percentiles compose poorly, and what to measure instead.