Observability Three Pillars: Metrics, Logs, Traces
What This Concept Is
Observability is the property of a system that lets you ask questions about its behavior from its outputs, without deploying new code. Monitoring answers the questions you anticipated; observability answers the questions you did not.
The three pillars are three different data shapes, each cheap at some tasks and expensive at others.
- Metrics: numerical time series (counters, gauges, histograms). Cheap to store, cheap to query, great at "is the system healthy right now" and "how has this number trended?" Weak at explaining why a number moved.
- Logs: timestamped event records, usually structured (JSON) with per-event fields. Great at "what specifically happened to this request." Expensive at scale (storage, cardinality) and slow to query across large volumes.
- Traces: request-scoped records that connect spans across services into a single view of one request's journey. Great at "where did the 2-second latency go" in a distributed system. Expensive to sample fully; most systems sample
1-10%.
Together they satisfy different questions. Metrics tell you there is a problem. Traces tell you where. Logs tell you what.
Why It Matters Here
Debugging at scale without all three is a slow, expensive guess. Debugging with only one is worse:
- Only metrics: you know the p99 doubled. You do not know which endpoint, which customer, which downstream.
- Only logs: you can read one user's full story. You cannot tell if the pattern is systemic without expensive aggregation.
- Only traces: you can see where one request spent time. You cannot tell if the issue is constant or transient without rates.
Cindy Sridharan's formulation introduced the "three pillars" framing. More recent practice (and the OpenTelemetry project) emphasizes that the three overlap: a trace contains span-metrics; a log can carry a trace ID; a metric can be derived from a log stream. The shape of the data is less important than the questions you can afford to ask.
Concrete Example
A team discovers p99 latency has doubled on the checkout service. The investigation path uses all three pillars:
- Metric (dashboard): "checkout p99 went from 300ms to 600ms at 14:07 UTC." This tells them when and what is different.
- Trace (trace system, filtered to slow checkouts at 14:07+): "most slow traces show
fraud-checkspan consuming500msout of600ms." This tells them where the time went. - Log (structured log, filtered by trace ID on a slow checkout): "fraud-check returned
503; the service retried twice with a200msbackoff each." This tells them why - the fraud-check was returning errors and the retry logic added latency to every request.
Without metrics, they would not have known to look. Without traces, they would have inspected the checkout code and missed the real problem. Without logs, they would have known where but not what the fraud-check was actually returning. All three pillars, 10 minutes to root cause.
Common Confusion / Misconception
"Observability is just logging." Logging is one pillar. An observability platform that only aggregates logs is missing the other two queries. You will find out what happened to one user readily and what is happening across all users never.
"We should log everything." Logs are expensive at scale. High-cardinality fields (user IDs, request IDs, URLs) explode index size and cost. The right answer: structured logs for errors and unexpected events, sampled traces for typical requests, and metrics for everything you need continuous visibility on.
"Traces are only useful after an outage." Traces also inform capacity planning (where does time go?), schema migrations (which consumers touch this field?), and deploy reviews (did the new version change span latencies?). A sampled trace pipeline is an ongoing asset, not an emergency tool.
How To Use It
For each service:
- Metrics: emit the RED signals (Rate, Errors, Duration) as histograms. Aggregate by endpoint and status. Keep cardinality bounded (do not label by user ID - that is a log field).
- Traces: propagate a trace context header (W3C Trace Context, OpenTelemetry) across every service hop. Sample at a rate you can afford (
~1%is common; sample100%of errors). Record span attributes for DB calls, external RPCs, and queue hops. - Logs: structured JSON, one field set per event shape. Include trace ID on every log line - that is the thread that stitches logs to traces to metrics.
- Know what each pillar is for. When a metric alerts, jump to traces in the same time window. From the slow trace, jump to logs on that trace ID. From logs, jump back to metrics to ask "how often?"
- Measure from outside. Observability inside the service misses gray failures at the edge. Synthetic probes from real user geographies catch what internal metrics cannot.
Check Yourself
- Which pillar is cheapest to store? Which is most flexible in ad-hoc queries? Which answers "why did this specific request fail"?
- Why is logging every user ID as a metric label a cardinality disaster?
- What does a trace ID do when written into log lines?
Mini Drill or Application
For a service you know, write one instrumentation plan with three sections (metrics / traces / logs). In each, name: one primary signal, its expected cardinality, its storage cost concern, and the question it is meant to answer. Leave no section blank.
Read This Only If Stuck
- Fundamentals of Software Architecture: Cross-Cutting Architecture Characteristics
- Fundamentals of Software Architecture: Measuring Architecture Characteristics
- Google SRE Book: Monitoring Distributed Systems
- Cindy Sridharan: "Monitoring and Observability" (blog essay)
- OpenTelemetry: What is Observability?