Tracing the Critical Path End-to-End
What This Concept Is
A distributed trace is a tree of timed operations (spans), all sharing one trace ID, that together describe a single user request as it crosses processes, services, queues, and dependencies. The trace is the only observability signal that answers "where did the time go?"
Tracing in practice means three decisions:
- what to instrument -- which entry points, which hops, which internal boundaries
- what to name -- span names should be low-cardinality verbs (
GET /webhook,db.insert events,kafka.publish), not one-off strings with IDs baked in; OpenTelemetry calls this the "span name" convention and requires it be a low-cardinality representation of the operation. - what to attribute -- attributes on spans carry context (user, tenant, route, size); high-cardinality items go here, following OpenTelemetry semantic conventions (
http.route,messaging.destination.name,db.system).
A subtle point often missed: OpenTelemetry has probabilistic sampling (head-based) and tail-based sampling, and a well-run pipeline combines them. Head sampling at the SDK is cheap but blind; tail sampling in the collector keeps every trace that errored or exceeded a latency threshold, so you retain the traces that matter without keeping them all.
For a capstone, the critical path is the one user journey your SLO is built on. Trace that journey, fully. Do not try to trace everything. A partial trace of the SLO path is infinitely more useful than a complete trace of the health-check endpoint.
Traces also unify your signals: logs and metrics correlated by trace_id become cross-service breadcrumbs, and OpenTelemetry exemplars attach a trace ID to a specific metric point so that "the 99th percentile went up" becomes "the 99th percentile went up and here is the trace that caused it."
Why It Matters Here (In the Capstone)
Traces are the only observability signal that answers "where did the time go?" The dashboard tells you p95 is 800 ms; only the trace tells you 600 ms of it was spent in an external API call, or in a serial loop that should have been parallel.
Runbook entries that say "open the trace for trace_id = <foo>" are worth ten entries that say "check the logs in service X, then service Y, then..." Concept 13 runbooks assume you have this capability. The "top 10 slow traces" panel in concept 5 assumes these traces are retained -- which depends on the sampling policy below.
Concrete Example(s) -- from a real capstone
Critical path for the webhook-handler capstone: POST /webhook -> verify signature -> rate-limit check -> publish to queue -> queue consumer -> write to DB -> notify downstream.
A well-formed trace looks like this (indentation = parent/child):
trace_id = t-a8f3...
span POST /webhook 350 ms service=api attrs: http.route=/webhook, provider=stripe
span verify_signature 3 ms service=api attrs: algo=hmac-sha256
span rate_limit.check 12 ms service=api attrs: bucket=stripe, allowed=true
span kafka.publish 8 ms service=api attrs: messaging.destination.name=events.incoming, partition=3
(async, same trace via traceparent propagation)
span consumer.handle events 290 ms service=worker attrs: messaging.destination.name=events.incoming
span db.insert events 14 ms service=worker attrs: db.system=postgresql, db.sql.table=events
span http.post downstream api 260 ms service=worker attrs: server.address=notify.svc, http.response.status_code=200
Four lessons from this trace:
- The API endpoint is not slow. Everything synchronous in
/webhookcompleted in ~25 ms. - The total experience is slow because the downstream notification takes 260 ms.
consumer.handle eventsis the place to optimize, notverify_signature.- The attributes let you filter: in the dashboard, traces where
provider=stripecould be slow while other providers are fine.
Without the trace, you would see a 350-ms p95 on one graph, a high consumer latency on another, and have to guess which caused which. With the trace, it is one screen.
Sampling policy (what to actually ship):
head-based : TraceIdRatioBased(0.02) # 2% of all traces at the SDK
tail-based : keep if
- root span returned 4xx or 5xx, OR
- any span's duration exceeded latency SLI threshold (300ms here), OR
- an exception was recorded on any span
else drop
This combination stores ~2-5% of all traces but 100% of the ones that failed or were slow. At capstone traffic that is a few hundred kept traces per day -- affordable in any OTel backend, and exactly the traces you actually want during incidents.
Common Confusion / Misconceptions
"We already have logs -- traces are redundant." Logs answer "what happened and why?" Traces answer "where did the time go and in what order?" They are complementary. In a trace, parent-child relationships and durations are the point; logs cannot express those cleanly.
"Sample every trace." Do not. At any real rate, 100% sampling is expensive and the long tail is mostly noise. The OpenTelemetry probability-sampling spec exists precisely so collectors across services make consistent keep/drop decisions for the same trace.
"Span names with the user ID in them are easier to search." They are also a cardinality explosion. Span names should be low-cardinality (db.insert events). Put the user ID on an attribute (user.id = "u-123") and use the attribute for filtering.
"Tracing means OpenTelemetry everywhere right now." It means instrument the critical path first. You do not need global coverage on day one. Entry point, every outgoing call on the critical path, and every async boundary -- that is the minimum.
"We'll sample 100% in dev and change nothing in prod." Dev and prod have different distributions; the tail in prod is where the interesting traces live, and a 100% dev sample rate teaches you nothing about what tail sampling will look like. Match your prod sampling policy in a staging environment of reasonable load at least once before go-live.
"Spans under 1ms are not worth keeping." Often true for averages; dangerous for tail analysis. Aggregate trivial spans into a single internal span if you must, but never drop them outright -- the one-millisecond span that happens 50 times in a serial loop is exactly the bug a trace is built to find.
How To Use It (In Your Capstone)
- Name the one critical path for your capstone (the SLO path).
- Instrument both ends: the entry point (HTTP, webhook, cron) and every outbound call (DB, HTTP, queue) on that path. Use OpenTelemetry auto-instrumentation for your framework where available -- hand-written spans only where the framework cannot see the boundary.
- Propagate
traceparentacross every hop (W3C TraceContext headers for HTTP, message attributes for queues). - Use OpenTelemetry semantic conventions for span names and attributes (
http.*,db.*,messaging.*) instead of inventing your own. - Set a sampling policy: head-sample at 1-5% (
TraceIdRatioBased), add a tail sampler in the collector that keeps all errored or slow traces. - Pick one recent real trace per week. Walk it in your tracing UI. If any significant hop is missing, add instrumentation. Repeat until the trace explains a full request end to end.
- Link trace URLs into runbooks and log lines (the same
trace_idfield from concept 4) so "the trace" is a click, not a search.
A 30-minute "trace walk" you should run weekly:
- Pick a random trace from the last hour that is at or near your p95 latency line.
- Read the spans top-to-bottom. Sum the durations. Verify they roughly equal the root span's duration (gaps mean uninstrumented work).
- Identify the single slowest child span. Ask: is that hop's slowness explained by its own attributes (e.g., a large payload, a cache miss) or external (a slow dependency)?
- File one follow-up ticket -- either an instrumentation gap or an optimisation opportunity. This cadence is how tracing coverage improves without a "tracing project."
See also (integrative)
- S6 M05 -- causality, partial ordering, and cross-process correlation: the theoretical backbone of what traces make concrete:
../../../../semester-06-databases-distributed/module-05-distributed-systems-fundamentals/concepts/cluster-02-time-clocks-and-ordering/05-lamport-clocks-and-happens-before-primary.md. - S6 M05 -- the "slow vs dead" asynchrony problem: why traces of timed-out RPCs look ambiguous:
../../../../semester-06-databases-distributed/module-05-distributed-systems-fundamentals/concepts/cluster-01-the-inescapable-reality/03-asynchrony-and-the-impossibility-of-distinguishing-slow-from-dead-primary.md. - S9 M05 Cluster 4 -- distributed tracing / OpenTelemetry / sampling in a cloud pipeline: where collectors and backends were set up:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-04-observability-pillars-in-cloud/12-distributed-tracing-opentelemetry-sampling-primary.md. - S5 M05 -- TCP handshake / TLS: the low-level boundaries that show up as network spans in a trace:
../../../../semester-05-os-networking/module-05-network-protocols-sockets/concepts/cluster-03-tcp-and-udp/09-the-tcp-handshake-and-state-machine-primary.md. - OpenTelemetry -- Concepts -- spans, context propagation, and the W3C TraceContext header format.
- OpenTelemetry -- Semantic conventions (traces) -- stable
http.*,db.*,messaging.*attribute names that span storage backends understand. - OpenTelemetry -- TraceState probability sampling -- how coherent sampling decisions are propagated across services so partial traces do not appear.
Check Yourself
- Why should a span name be low-cardinality but span attributes can be high-cardinality?
- What is the argument for always keeping traces of 5xx responses, regardless of your sample rate?
- How does a trace help you decide whether a p95 regression is your code or a downstream dependency?
- What does W3C TraceContext (the
traceparentheader) guarantee that a random correlation ID would not? - Why does the sampling policy combine head-based AND tail-based sampling instead of using one or the other?
- If you could only instrument one span for your capstone today, which would it be and why?
Mini Drill or Application (Capstone-scoped)
- In 45 minutes: instrument one endpoint in your capstone with OpenTelemetry (or the native tracer for your stack).
- Ensure at least three downstream calls show up as child spans with OpenTelemetry-compliant attribute names.
- Force an error (e.g., point one dependency at a bad host) and confirm the error trace is retained by the tail sampler.
- Reproduce one real slow request and walk the trace until you can identify the hop that consumed > 50% of the latency.
- Write one paragraph in
library/raw/tracing.mddescribing: which path is instrumented, the sampling policy, and how a runbook should link to a trace (<tracing-ui>/trace/<trace_id>).
Source Backbone
Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.
- Building Secure and Reliable Systems - primary security and reliability backbone.
- Software Engineering at Google - operational process and engineering discipline.
- Designing Distributed Systems - service and reliability pattern support.