Distributed Tracing and Correlation IDs

What This Concept Is

In a microservices system, a single user request can touch 5-30 services before the response returns. Distributed tracing is the observability discipline that lets you reconstruct one request's path through all those services as a single coherent trace.

Two building blocks:

Correlation ID (or trace ID). A unique ID generated at the edge (gateway or BFF), propagated in headers to every downstream call, and stamped into every log line. All events carrying the same correlation ID belong to the same request.
Spans. Each service operation that handles or emits a request records a span: trace_id, span_id, parent_span_id, start and end timestamps, service name, operation name, and key attributes (status code, error, etc.). Spans form a parent-child tree that reconstructs the request's topology and timing.

Today's standard is OpenTelemetry (OTel): a vendor-neutral set of APIs and conventions for emitting traces, metrics, and logs. Any mature tracing backend (Jaeger, Tempo, Honeycomb, Datadog, New Relic, AWS X-Ray) consumes OTel.

Why It Matters Here

Without tracing, production troubleshooting degenerates to grepping logs across multiple services and trying to correlate by timestamp. That does not work when p99 latency lives in a specific hop three services deep. With tracing, you open a slow request and see which span owns the latency.

Tracing is also the only practical way to detect violations of the resilience primitives (concept 12): retries showing up, timeouts firing, circuit breakers opening.

Concrete Example: A Trace Waterfall

A POST /checkout that took 1.8s, reconstructed from the trace backend:

trace_id=f3a1...  (POST /checkout, total 1820ms)
├── gateway                          [  0ms .. 1820ms ]  1820ms
│   ├── authn                        [  5ms ..  18ms ]    13ms
│   └── bff.mobile                   [ 20ms .. 1800ms]  1780ms
│       ├── accounts.getProfile      [ 25ms ..  75ms ]    50ms   (span 1)
│       ├── cart.getCart             [ 25ms .. 120ms ]    95ms   (span 2, parallel)
│       ├── orders.createOrder       [125ms .. 1790ms]  1665ms   (span 3)
│       │   ├── payments.authorize   [130ms .. 280ms ]   150ms   (span 3.1)
│       │   ├── inventory.reserve    [280ms .. 1780ms]  1500ms   (span 3.2) *** slow ***
│       │   │   └── inventory.db     [285ms .. 1770ms]  1485ms   (span 3.2.1) *** slow ***
│       │   └── outbox.publish       [1780ms.. 1790ms]    10ms   (span 3.3)

Just by reading the waterfall, you see:

total latency is 1820ms
bff.mobile is doing two parallel calls (accounts and cart) correctly
the slow span is inventory.reserve, and inside it inventory.db accounts for almost all the cost
the culprit is a slow DB query on the Inventory service, not a network issue and not the payments service

No amount of log grepping gives you this. One trace does.

Header Propagation: The W3C Trace Context Standard

The standard headers that carry trace context across services:

traceparent: 00-<trace_id>-<parent_id>-<flags> -- e.g., 00-f3a1b2c4d5e6f708091a2b3c4d5e6f70-1234567890abcdef-01
tracestate: <vendor-specific extras>

Additional conventions: a readable correlation ID like X-Correlation-Id (often = trace_id) that is easy to surface in support UIs and customer-facing error messages.

Every service must:

Accept these headers on incoming requests.
Propagate them (or derived child span IDs) on outgoing requests.
Stamp the correlation ID into every log line.

Missing any of the three makes tracing useless past that hop.

Common Confusion / Misconception

"We have logs, that is enough." Logs are per-service and not linked across services. Without the correlation ID, "the same request" is guesswork.

"Tracing is expensive so we sample heavily." True but nuanced. Tail-based sampling (sample the full trace only if it is slow or errored) gets you the interesting traces at low cost. Head-based sampling is cheaper but misses most interesting traces.

"OpenTelemetry is optional, we will add it later." Instrumentation is much easier on day one than on day 900. Insisting on correlation ID + trace propagation from the first service is much cheaper than retrofitting.

How To Use It

Adopt OpenTelemetry (or the equivalent SDK for your language). Instrument every HTTP client, HTTP server, DB driver, and message bus library.
Generate trace_id at the edge (API gateway or mobile client).
Propagate traceparent on every outbound call, including async events (put it in the event envelope).
Stamp correlation ID into every log line (%X{trace_id} in most loggers).
Ship spans and logs to a backend that supports correlation (Grafana Tempo + Loki, Honeycomb, Datadog, etc.).
In incident response, start every investigation from a trace, not a log.

Check Yourself

Why is a correlation ID on its own not enough to visualize a slow request?
What does a span's parent_span_id buy you that a flat list of events does not?
What gets dropped if one service in the middle fails to propagate traceparent?

Mini Drill or Application

Take the e-commerce POST /checkout flow. In 15 minutes:

Draw the trace waterfall you would expect.
Annotate which span would carry the p99 latency if inventory is the slow service.
List the three headers that must propagate, and the one place each service must stamp them.

How This Sits In The Module

Tracing makes the rest of this cluster observable: you can tell if circuit breakers are tripping, how often retries fire, whether the BFF fan-out is parallel, and which consumer of an event is slow. Concept 14 (deployment independence) needs tracing to verify that new versions have not regressed.

The Three Observability Signals: Traces, Metrics, Logs

Traces are one of three signals, and they are most powerful when the other two are linked to them:

Signal	What it answers	Cost	Source of truth for
Metrics	"How many requests per second? What's the error rate?"	Cheap, aggregate	Rates, distributions, SLOs
Logs	"What exactly happened in request X?"	Medium, verbose	Detailed per-event history
Traces	"Where did the time go in request X?"	Medium; tail-sample for cost	Cross-service causality

Linking them: every log line and metric data point should carry the trace_id of the request that produced it. This is the core insight of "observability-driven development" (Charity Majors): correlate across signals to ask questions you did not pre-plan. OpenTelemetry's Logs specification formalizes log-to-trace linking.

Async Tracing: Propagating Through Events

Trace propagation through async events is the most frequently broken piece of instrumentation. The pattern:

Producer, before publishing, extracts the current span context and places traceparent into the event envelope (not the domain payload -- the envelope).
Consumer, on receipt, extracts traceparent from the envelope and creates a new span with the extracted context as the parent (or as a "follows-from" link).
The trace backend shows the producer's span and the consumer's span linked, even across hours of latency.

Without this, tracing stops at the event boundary. With it, you can see an event fired at 09:00 and consumed at 09:02 as one causal chain. CloudEvents has a distributed-tracing extension that standardizes this for its envelope; Kafka headers and SQS message attributes are where it lives in practice.

Read This Only If Stuck

Local chunks

FoSA: Operations / DevOps -- the operational frame for observability practice.
FoSA: Measuring Architecture Characteristics -- observability is how you measure runtime characteristics.
FoSA: Fitness Functions -- tracing data feeds continuous fitness checks (p99 latency, error rate budgets).
FoSA: Engineering Practices -- tracing instrumentation is a day-one engineering practice for microservices.
Primer: Availability Patterns -- tracing reveals whether availability primitives (cluster 4) are tripping correctly.
Primer: Latency vs Throughput -- the vocabulary you will use when reading traces.

External canonical references

OpenTelemetry, Concepts: Signals, Traces -- canonical definitions.
W3C, Trace Context -- the traceparent header format.
Chris Richardson, Distributed Tracing pattern -- the pattern-catalog summary.
Benjamin Sigelman et al., Dapper: A Large-Scale Distributed Systems Tracing Infrastructure -- the 2010 Google paper that started the field.
Charity Majors, Observability ≠ Monitoring -- the vocabulary distinction this concept assumes.
Honeycomb, Distributed tracing documentation -- practical introduction with modern UI.
Grafana, Tempo documentation -- open-source tracing backend aligned with OTel.
Google SRE Book, Monitoring Distributed Systems -- the foundational chapter on what to instrument.
CloudEvents, Distributed Tracing extension -- async propagation standard.

Depth Path

Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) -- modern take on traces + high-cardinality events. Return in S8 M4 where SLI/SLO meets observability.
Cindy Sridharan, Distributed Systems Observability (free O'Reilly report) -- the short-form alternative.

What This Concept Is​

Why It Matters Here​

Concrete Example: A Trace Waterfall​

Header Propagation: The W3C Trace Context Standard​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

How This Sits In The Module​

The Three Observability Signals: Traces, Metrics, Logs​

Async Tracing: Propagating Through Events​

Read This Only If Stuck​

Local chunks​

External canonical references​

Depth Path​