Tracing the Critical Path End-to-End

What This Concept Is

A distributed trace is a tree of timed operations (spans), all sharing one trace ID, that together describe a single user request as it crosses processes, services, queues, and dependencies. The trace is the only observability signal that answers "where did the time go?"

Tracing in practice means three decisions:

what to instrument -- which entry points, which hops, which internal boundaries
what to name -- span names should be low-cardinality verbs (GET /webhook, db.insert events, kafka.publish), not one-off strings with IDs baked in; OpenTelemetry calls this the "span name" convention and requires it be a low-cardinality representation of the operation.
what to attribute -- attributes on spans carry context (user, tenant, route, size); high-cardinality items go here, following OpenTelemetry semantic conventions (http.route, messaging.destination.name, db.system).

A subtle point often missed: OpenTelemetry has probabilistic sampling (head-based) and tail-based sampling, and a well-run pipeline combines them. Head sampling at the SDK is cheap but blind; tail sampling in the collector keeps every trace that errored or exceeded a latency threshold, so you retain the traces that matter without keeping them all.

For a capstone, the critical path is the one user journey your SLO is built on. Trace that journey, fully. Do not try to trace everything. A partial trace of the SLO path is infinitely more useful than a complete trace of the health-check endpoint.

Traces also unify your signals: logs and metrics correlated by trace_id become cross-service breadcrumbs, and OpenTelemetry exemplars attach a trace ID to a specific metric point so that "the 99th percentile went up" becomes "the 99th percentile went up and here is the trace that caused it."

Why It Matters Here (In the Capstone)

Traces are the only observability signal that answers "where did the time go?" The dashboard tells you p95 is 800 ms; only the trace tells you 600 ms of it was spent in an external API call, or in a serial loop that should have been parallel.

Runbook entries that say "open the trace for trace_id = <foo>" are worth ten entries that say "check the logs in service X, then service Y, then..." Concept 13 runbooks assume you have this capability. The "top 10 slow traces" panel in concept 5 assumes these traces are retained -- which depends on the sampling policy below.

Concrete Example(s) -- from a real capstone

Critical path for the webhook-handler capstone: POST /webhook -> verify signature -> rate-limit check -> publish to queue -> queue consumer -> write to DB -> notify downstream.

A well-formed trace looks like this (indentation = parent/child):

trace_id = t-a8f3...
span  POST /webhook                 350 ms   service=api     attrs: http.route=/webhook, provider=stripe
  span  verify_signature              3 ms   service=api     attrs: algo=hmac-sha256
  span  rate_limit.check             12 ms   service=api     attrs: bucket=stripe, allowed=true
  span  kafka.publish                 8 ms   service=api     attrs: messaging.destination.name=events.incoming, partition=3
  (async, same trace via traceparent propagation)
  span  consumer.handle events      290 ms   service=worker  attrs: messaging.destination.name=events.incoming
    span  db.insert events           14 ms   service=worker  attrs: db.system=postgresql, db.sql.table=events
    span  http.post downstream api  260 ms   service=worker  attrs: server.address=notify.svc, http.response.status_code=200

Four lessons from this trace:

The API endpoint is not slow. Everything synchronous in /webhook completed in ~25 ms.
The total experience is slow because the downstream notification takes 260 ms.
consumer.handle events is the place to optimize, not verify_signature.
The attributes let you filter: in the dashboard, traces where provider=stripe could be slow while other providers are fine.

Without the trace, you would see a 350-ms p95 on one graph, a high consumer latency on another, and have to guess which caused which. With the trace, it is one screen.

Sampling policy (what to actually ship):

head-based  : TraceIdRatioBased(0.02)   # 2% of all traces at the SDK
tail-based  : keep if
                 - root span returned 4xx or 5xx, OR
                 - any span's duration exceeded latency SLI threshold (300ms here), OR
                 - an exception was recorded on any span
                 else drop

This combination stores ~2-5% of all traces but 100% of the ones that failed or were slow. At capstone traffic that is a few hundred kept traces per day -- affordable in any OTel backend, and exactly the traces you actually want during incidents.

Common Confusion / Misconceptions

"We already have logs -- traces are redundant." Logs answer "what happened and why?" Traces answer "where did the time go and in what order?" They are complementary. In a trace, parent-child relationships and durations are the point; logs cannot express those cleanly.

"Sample every trace." Do not. At any real rate, 100% sampling is expensive and the long tail is mostly noise. The OpenTelemetry probability-sampling spec exists precisely so collectors across services make consistent keep/drop decisions for the same trace.

"Span names with the user ID in them are easier to search." They are also a cardinality explosion. Span names should be low-cardinality (db.insert events). Put the user ID on an attribute (user.id = "u-123") and use the attribute for filtering.

"Tracing means OpenTelemetry everywhere right now." It means instrument the critical path first. You do not need global coverage on day one. Entry point, every outgoing call on the critical path, and every async boundary -- that is the minimum.

"We'll sample 100% in dev and change nothing in prod." Dev and prod have different distributions; the tail in prod is where the interesting traces live, and a 100% dev sample rate teaches you nothing about what tail sampling will look like. Match your prod sampling policy in a staging environment of reasonable load at least once before go-live.

"Spans under 1ms are not worth keeping." Often true for averages; dangerous for tail analysis. Aggregate trivial spans into a single internal span if you must, but never drop them outright -- the one-millisecond span that happens 50 times in a serial loop is exactly the bug a trace is built to find.

How To Use It (In Your Capstone)

Name the one critical path for your capstone (the SLO path).
Instrument both ends: the entry point (HTTP, webhook, cron) and every outbound call (DB, HTTP, queue) on that path. Use OpenTelemetry auto-instrumentation for your framework where available -- hand-written spans only where the framework cannot see the boundary.
Propagate traceparent across every hop (W3C TraceContext headers for HTTP, message attributes for queues).
Use OpenTelemetry semantic conventions for span names and attributes (http.*, db.*, messaging.*) instead of inventing your own.
Set a sampling policy: head-sample at 1-5% (TraceIdRatioBased), add a tail sampler in the collector that keeps all errored or slow traces.
Pick one recent real trace per week. Walk it in your tracing UI. If any significant hop is missing, add instrumentation. Repeat until the trace explains a full request end to end.
Link trace URLs into runbooks and log lines (the same trace_id field from concept 4) so "the trace" is a click, not a search.

A 30-minute "trace walk" you should run weekly:

Pick a random trace from the last hour that is at or near your p95 latency line.
Read the spans top-to-bottom. Sum the durations. Verify they roughly equal the root span's duration (gaps mean uninstrumented work).
Identify the single slowest child span. Ask: is that hop's slowness explained by its own attributes (e.g., a large payload, a cache miss) or external (a slow dependency)?
File one follow-up ticket -- either an instrumentation gap or an optimisation opportunity. This cadence is how tracing coverage improves without a "tracing project."

Check Yourself

Why should a span name be low-cardinality but span attributes can be high-cardinality?
What is the argument for always keeping traces of 5xx responses, regardless of your sample rate?
How does a trace help you decide whether a p95 regression is your code or a downstream dependency?
What does W3C TraceContext (the traceparent header) guarantee that a random correlation ID would not?
Why does the sampling policy combine head-based AND tail-based sampling instead of using one or the other?
If you could only instrument one span for your capstone today, which would it be and why?

Mini Drill or Application (Capstone-scoped)

In 45 minutes: instrument one endpoint in your capstone with OpenTelemetry (or the native tracer for your stack).
Ensure at least three downstream calls show up as child spans with OpenTelemetry-compliant attribute names.
Force an error (e.g., point one dependency at a bad host) and confirm the error trace is retained by the tail sampler.
Reproduce one real slow request and walk the trace until you can identify the hop that consumed > 50% of the latency.
Write one paragraph in library/raw/tracing.md describing: which path is instrumented, the sampling policy, and how a runbook should link to a trace (<tracing-ui>/trace/<trace_id>).

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.

Building Secure and Reliable Systems - primary security and reliability backbone.
Software Engineering at Google - operational process and engineering discipline.
Designing Distributed Systems - service and reliability pattern support.

What This Concept Is​

Why It Matters Here (In the Capstone)​

Concrete Example(s) -- from a real capstone​

Common Confusion / Misconceptions​

How To Use It (In Your Capstone)​

See also (integrative)​

Check Yourself​

Mini Drill or Application (Capstone-scoped)​

Source Backbone​