Latency, Throughput, Utilization, and the USE / RED / Four Golden Signals

What This Concept Is

Three quantities appear in every performance discussion and get confused with each other constantly:

Latency: the time from request submitted to response received. A duration. Measure in ms or s, per request.
Throughput: the number of requests completed per unit time. A rate. Measure in req/s, ops/s, MB/s.
Utilization: the fraction of a resource's capacity currently in use. A ratio in [0, 1]. Resource-scoped (per CPU, per disk, per pool).

They are not the same number, they do not move together, and a healthy dashboard reports all three. The System Design Primer is explicit: "latency is the time to perform some action or to produce some result; throughput is the number of such actions or results per unit of time" - two separate quantities, not two names for the same thing.

On top of these three base quantities, the industry has three canonical framings that tell you which metrics to actually put on a dashboard:

USE Method (Brendan Gregg, for resources): for every resource, measure Utilization, Saturation, Errors. Gregg's one-line summary is literally "For every resource, check utilization, saturation, and errors." He defines saturation not as a utilization >1 but as "the degree to which the resource has extra work which it can't service, often queued" - a queue length, run-queue depth, or wait count.
RED Method (Tom Wilkie, for services): for every service, measure Rate (requests/sec), Errors, Duration (latency).
Four Golden Signals (Google SRE): Latency, Traffic, Errors, Saturation. The SRE book adds two sharp refinements: (1) "distinguish between the latency of successful requests and the latency of failed requests" because a fast 500 looks good in an average but is still a failure; (2) "latency increases are often a leading indicator of saturation."

They overlap. That is a feature, not a bug - each framing optimizes for a different vantage point, and a mature team uses all three.

Why It Matters Here

If you cannot separate latency, throughput, and utilization in one sentence, every downstream topic in this module fails:

percentile latency is a distribution of a duration; confusing it with throughput produces nonsense.
the Universal Scalability Law bounds throughput, not latency directly.
Little's Law (L = lambda * W) connects them: queue length equals arrival rate times wait time.
SLOs are written against a specific quantity; conflating them makes the SLO impossible to measure.
Capacity planning numbers are only trustworthy when they bind the right quantity to the right resource.

The USE/RED/Golden-Signals framings matter because they prevent the single most common production dashboarding failure: instrumenting whatever was easiest and calling it monitoring. Gregg's claim is that USE solves roughly 80% of server performance issues with 5% of the effort - a return on attention that no other methodology in this module matches.

Concrete Example

Imagine a service with one request-handling thread.

It takes 50ms per request on the CPU: latency = 50ms.
It completes 20 req/s: throughput = 20 req/s.
The CPU is busy 20 * 0.05 = 1.0 fraction of the second: utilization = 100%.

Now add a second thread.

Each request still takes 50ms on CPU: latency = 50ms (unchanged).
Two threads now complete 40 req/s: throughput = 40 req/s (doubled).
Each CPU is at 100% utilization, so per-CPU utilization = 100%; the machine utilization may read as 200% of one core, which is why you measure resources at the resource, not at the machine.

Now overload the two-thread service: arrivals jump to 60 req/s while service time is still 50ms.

Queue forms. Latency starts climbing (wait time + service time), often catastrophically.
Throughput stays capped at 40 req/s because that is the service rate.
Utilization is pegged at 100%; saturation (queue depth) is now the metric that matters.

Three numbers. Three different stories. One service. Gregg's counter-intuitive warning applies here too: he describes a case where CPU utilization "was never higher than 80%" on 5-minute averages, yet the system saturated because CPU hit 100% for seconds at a time. Averaging hides bursts; saturation is visible in the queue, not in the mean.

A second worked example (RED vs USE on a real service). A payment gateway serves 10,000 req/s with p99 450ms at steady state. The RED dashboard reports: Rate 10,000, Errors 0.02%, Duration p99 450ms. All green. The USE dashboard on the same service's thread pool reports: Utilization 72% average but 100% during flushes, Saturation (queue depth) rising by 200/s for the last 15 minutes, Errors 0. USE is telling a story RED cannot: the service is accumulating queued work. Give it another 60 seconds and RED will catch up with a catastrophic Duration spike, but by then the queue is already too long to drain at mu = 10,000 req/s. USE gave you 15 minutes of warning; RED will give you 30 seconds.

Common Confusion / Misconception

"The server is only at 40% CPU, so latency is fine." No. CPU is one resource. The USE method says you must also ask about memory bandwidth, disk IO queue depth, network buffers, lock contention, file descriptors, thread pools, and garbage-collection pauses. A dashboard showing only CPU utilization routinely misses the resource that actually saturated first. Gregg's USE Method page makes this its core argument: "check every resource, at every level, for utilization, saturation, and errors - then you know."

"Our throughput went down, so latency must have gone up." Not necessarily. Throughput can drop because of lower arrival rate (the upstream got slow) while latency of the requests that do arrive is perfectly healthy. You must read them as independent signals that sometimes happen to correlate.

"Average latency is fine to alert on." The Google SRE book is explicit: mean latency hides tail behavior. If you run a service with mean = 100ms at 1,000 req/s, 1% of requests can easily take 5s each, and "the 99th percentile of one backend can easily become the median response of your frontend" in a multi-service page load. Next concept covers this in depth; do not alert on means.

"Utilization above 80% is efficient." See the queuing-theory concept later in this module. Past ~80% utilization, wait time curves up steeply for any system with variable service times. "Efficient" on the billing report, "cliff" on the latency graph.

How To Use It

For every service you operate:

Pick which framing to use. RED for services with clear request-response semantics is the default. USE for the underlying resources (CPU, memory, disk, network, connection pools, thread pools). Golden Signals when you want one dashboard that covers both services and resources at a glance.
For each signal, write down what metric, what unit, what threshold, and what alert it drives. If you cannot fill in all four columns, the dashboard is not done.
Alert on symptoms (latency SLO burn, error rate, saturation), not on causes (CPU utilization alone). The SRE book's "symptoms vs causes" distinction is the single most important alerting principle: a page must represent a real user problem, not a metric that happens to be high.
Combine black-box and white-box monitoring. Black-box (synthetic probes from outside) tells you what users experience. White-box (internal counters) tells you why. The SRE book recommends "heavy use of white-box monitoring with modest but critical uses of black-box."
Keep alert rules "simple, predictable, and reliable" - Google's own guideline, born from experience that complex dependency-aware alerting is fragile.

Check Yourself

A service's CPU is at 40% and its latency p99 doubled overnight. USE says: what resources must you check before concluding anything?
Why does RED drop Saturation and use only Rate, Errors, Duration - and why is that often fine for a stateless HTTP service?
"Average latency" is not in USE, RED, or Golden Signals. Why?
Give a concrete example where utilization is low but saturation is high. What bug class produces this pattern?
Why should a page fire on user-visible symptoms rather than on a high CPU alert, even when high CPU is often the proximate cause?

Mini Drill or Application

Pick a service you have used (your own, or a well-documented open-source one). For each framing (USE, RED, Golden Signals), fill in the full table:

Signal	Metric	Unit	Typical healthy value	Alerting threshold

Now pressure-test it. For each row, ask: "If this metric went bad, would a user notice?" If the answer is "no" or "not directly," you have a cause metric, not a symptom metric - move it off the alert path and into the investigation dashboard. If you cannot fill in any one cell, you do not yet understand what that framing expects of you.

Transfer / Where This Shows Up Later

These three quantities (and the three framings that cover them) are the measurement substrate every later reliability concept assumes. You cannot reason about any of the following without them:

This module, concept 02 (percentile latency): latency-as-distribution is what "Duration" in RED and "Latency" in Golden Signals actually mean. Means are out; histograms are in.
This module, concepts 10-11 (Little's Law, load shedding): lambda and utilization show up directly in the queueing formulas; alerting on saturation is how you detect the cliff before it tips.
This module, concept 07 (SLOs): the SLI is a ratio of "good events" to "valid events" - which is just a carefully bounded slice of RED's Rate, Errors, and Duration.
S8 M5 (technical leadership): when you present reliability trade-offs to a product org, you present them in these quantities. "We can ship faster if we accept a 200ms increase in p99" is the only sentence a VP trusts.
S9 M5 (cloud security + observability): metrics/logs/traces pipelines emit exactly the USE/RED/Golden-Signals data; choosing your pillar budget is choosing which framings you can cover.
S10 M3-M4 (capstone deploy + operational review): an operational-readiness review asks one question in four ways: can you see USE, RED, and Golden Signals for every tier, and do pages fire on symptoms rather than causes?

If any later concept feels abstract, trace it back to "which of latency, throughput, utilization, saturation is this about?" The answer is almost always one of them.

One final calibration: at S8 scale, you are often the person who picks which quantity gets alerted, which gets graphed, and which gets ignored. The USE/RED/Golden-Signals framings give you three vocabularies to justify those picks to people who are not in the room with the incident. Treat them as negotiation tools, not just engineering tools.

Read This Only If Stuck

Local chunks (book anchors)

System Design Primer: Performance vs Scalability -- the two-paragraph distinction between "faster" and "bigger" that every dashboard must preserve.
System Design Primer: Latency vs Throughput -- the canonical short-form definition; the "latency-optimized vs throughput-optimized" trade-off appears in every queue and pipeline design.
System Design Primer: Powers of Two and Latency Numbers -- Jeff Dean's "numbers every programmer should know" table. Keep a copy of this in your head before quoting any ms/us number.
FoSA: Architecture Characteristics Defined -- performance, scalability, and elasticity as distinct "-ilities" each measured differently.
FoSA: Measuring Architecture Characteristics -- why you instrument before the architecture review, not after.
FoSA: Fitness Functions -- the continuous version of "did we hold the number"; USE/RED/Golden-Signals alerts are fitness functions for runtime properties.

External canonical references

Brendan Gregg, The USE Method and USE Method: Linux Performance Checklist -- the one-page method and the full resource-by-resource checklist that covers 80% of server issues.
Google SRE Book, Monitoring Distributed Systems -- the chapter that introduced the Four Golden Signals; read the "symptoms vs causes" and "worrying about your tail" subsections in full.
Tom Wilkie / Grafana, The RED Method -- the original rationale for Rate/Errors/Duration as the minimum viable service dashboard.
Marc Brooker (AWS Principal Engineer), Metastability and Distributed Systems -- why utilization-based alerting misses the moment a healthy system becomes permanently unhealthy.
AWS Builders' Library, Instrumenting distributed systems for operational visibility -- Amazon's internal instrumentation doctrine; pairs every RED metric with a per-request log line so you can drill down.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Transfer / Where This Shows Up Later​

Read This Only If Stuck​

Local chunks (book anchors)​

External canonical references​