Skip to main content

Metrics: Cardinality, Exemplars, and USE/RED in Cloud-Native

What This Concept Is

A metric is a numeric time series: a name, a set of labels, and a stream of (timestamp, value) samples. In cloud-native stacks (Prometheus, OpenTelemetry metrics, managed services) metrics are the cheapest, fastest-to-query signal. They are also the easiest to abuse.

Three ideas do most of the work:

  • Cardinality -- the number of unique label-value combinations for a metric. A metric http_requests_total{method, status} with 4 methods and 5 statuses has cardinality 20. A metric http_requests_total{method, status, user_id} with 1M users has cardinality ~20M. Each unique combination is its own time series in the database, with its own storage, memory, and compute cost.
  • Exemplars -- a sample attached to a metric that points to one concrete trace ID or event for that bucket. Exemplars are the bridge from "aggregated metric" to "specific request": the latency histogram tells you 99p is bad, the exemplar hands you the trace of one actual slow request.
  • USE and RED -- two small vocabularies for which metrics to emit:
    • USE (Brendan Gregg): for resources -- Utilization, Saturation, Errors
    • RED (Tom Wilkie): for request-driven services -- Rate, Errors, Duration

Prometheus's own instrumentation and naming docs lean on these ideas: label carefully, avoid high cardinality, name metrics by what they measure, use base units, and choose the right metric type (counter, gauge, histogram).

Why It Matters Here

Metrics are where observability budgets get burned or saved. A single careless label turns a $0.01 metric into a $10,000/month metric and takes down your time-series database in the process. Grafana's own guidance on cardinality spikes calls this out: high cardinality can lead to "increased resource consumption, potential memory errors, and higher operational costs".

Cardinality discipline is also what makes metrics useful. A dashboard that sums 20M time series at query time is slow and meaningless; a dashboard that sums 20 well-chosen series answers a real question in under a second.

USE and RED give you a default set of metrics per component shape so you do not have to re-invent what to emit each time.

Concrete Example

An HTTP service serving an API.

Bad metrics:

http_requests_total{method, path, status, user_id, customer_id, session_id, request_id}

This metric explodes cardinality on four axes: user_id, customer_id, session_id, and request_id. request_id alone makes cardinality unbounded. The metric database either falls over or silently drops labels.

Good metrics (RED for the service):

http_requests_total{method, route, status_class}
# route is the pattern "/users/:id", not the instantiated "/users/1234"
# status_class is "2xx" / "4xx" / "5xx", not every individual status code

http_request_duration_seconds_bucket{method, route, le}
# histogram with a fixed set of buckets, same low-cardinality labels

http_inflight_requests{route}
# gauge, one per route

Good metrics (USE for the process/host):

process_cpu_seconds_total  (utilization source)
process_open_fds / process_max_fds (saturation)
process_memory_bytes (utilization)

Exemplar use: on the http_request_duration_seconds_bucket metric, attach a trace ID on the request that landed in the 99th percentile bucket. On a dashboard, clicking that outlier jumps to the distributed trace from Concept 12.

PromQL queries derived from RED:

# Request rate per route (R)
sum by (route) (rate(http_requests_total[5m]))

# Error rate as a fraction of requests (E)
sum by (route) (rate(http_requests_total{status_class="5xx"}[5m]))
/
sum by (route) (rate(http_requests_total[5m]))

# p99 latency per route (D), computed from the histogram
histogram_quantile(0.99,
sum by (route, le) (rate(http_request_duration_seconds_bucket[5m])))

# Error budget burn rate over 1h, relative to a 99.9% SLO
(1 -
sum(rate(http_requests_total{status_class="2xx"}[1h]))
/
sum(rate(http_requests_total[1h])))
/ (1 - 0.999)

That last query is the single most valuable one in SRE practice: "how fast are we burning our error budget right now?" It goes straight into the alerting rules in Concept 14.

Common Confusion / Misconception

"More labels = more information." The opposite is usually true. Past a few well-chosen labels, each new one multiplies the storage cost and reduces the number of queries that can actually run in time. A user ID label makes the metric into a (bad) log. Grafana's cardinality post is explicit: high cardinality leads to "increased resource consumption, potential memory errors, and higher operational costs".

"Counter vs gauge is cosmetic." Counters only go up (except on reset); gauges go up and down. A counter for "current number of active connections" is wrong and will make rate() queries lie (rate over a gauge is nonsense). The right metric type is decided by semantics, not preference.

"Histograms capture every value." They do not. They are pre-binned. You choose buckets once, for what you actually want to observe (e.g. 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s for HTTP). Changing buckets later is painful. Align buckets with your SLO threshold so the quantile you care about lands on a bucket edge.

"histogram_quantile gives the real p99." It gives the interpolated p99 from the buckets you chose. If your SLO is 300 ms and the nearest bucket is 250 / 500, your reported p99 is a linear guess between those two. Choose buckets that bracket the SLO closely, or use native histograms where supported.

"USE and RED are a cap." They are defaults, not a cap. They tell you what you should always have; you still add domain-specific metrics (queue depth, batch size, cache hit ratio, inventory reservation lag, payment provider success). USE covers resources; RED covers request-driven services; SRE's "Four Golden Signals" (latency, traffic, errors, saturation) is a superset that also covers both.

"Prometheus and OTel are interchangeable." They overlap but are not identical. OpenTelemetry metrics use a deltas/cumulative temporality choice, a different exemplar model, and richer attribute semantics. When bridging, be explicit about conversion rules -- especially for counters resetting across exports.

How To Use It

For any service you instrument:

  1. Decide whether it is request-driven (RED) or resource-driven (USE) -- or both.
  2. Pick 3-8 metrics that answer "is this healthy?" Not more.
  3. For each metric, list the labels. For each label, ask: "is this bounded in cardinality, and does someone actually query by it?" If either answer is no, remove the label.
  4. For histograms, pick buckets aligned with your SLO thresholds (see Concept 14).
  5. Wire exemplars so that latency outliers on the histogram link to a trace.

Check Yourself

  1. Why does adding a user_id label make a metric less useful, not more?
  2. What question does RED answer that USE does not, and vice versa?
  3. What does an exemplar give you that a raw histogram bucket count does not?

Mini Drill or Application

For a service you have touched, write down:

  • the 5 metrics you would keep if forced to choose (name, type, labels)
  • the one metric you currently have that should have a label removed
  • one SLO (e.g. "99% of requests under 300ms over 28 days") and the histogram buckets that make measuring it cheap

See also (external)

Depth Path


Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.