Skip to main content

Distributed Tracing: OpenTelemetry and Sampling Strategies

What This Concept Is

A trace is the record of a single request as it moves across the services that handle it. Each unit of work is a span. Spans have a name, a start/end time, attributes (key-value metadata), a status, and a parent span ID. The trace_id ties the whole chain together; span_id identifies each step.

trace_id = 8c1b...
+-- span: POST /checkout (frontend) [120 ms]
+-- span: GET /inventory (inventory-svc) [15 ms]
+-- span: POST /charge (payments-svc) [70 ms]
| +-- span: HTTP POST external PSP [60 ms]
+-- span: POST /orders (orders-svc) [25 ms]

OpenTelemetry (OTel) is the CNCF-incubating open standard for emitting traces (and metrics and logs) in a vendor-neutral way. Its key ideas:

  • Signals: traces, metrics, logs. Same model, same SDKs.
  • Tracer providers and tracers: set up once per process, used to create spans.
  • Context propagation: the trace_id/span_id are propagated across process boundaries (usually via W3C traceparent HTTP header).
  • Semantic conventions: standardized attribute names (http.request.method, http.response.status_code, db.system, messaging.destination.name) so that dashboards and tools work across services.
  • Exporters and the Collector: spans are shipped out of the process to a Collector that filters, samples, and forwards.

Sampling is how you keep the cost sane. The OTel sampling doc is explicit: sampling reduces data volume while keeping visibility, and there are two major strategies.

  • Head sampling: decide at the root of the trace (e.g. "keep 5% of requests"). Cheap, deterministic, but blind to what happens later.
  • Tail sampling: collect the whole trace, then decide. Keeps all error traces, all slow traces, all traces with a particular attribute; drops boring fast 200s. More expensive but vastly more useful.

Why It Matters Here

In a service mesh of 10+ microservices, metrics tell you something is slow, and logs tell you what happened in one place, but traces tell you where the time went across the whole request path. That is the question on-call actually has at 2 a.m., and without traces it is hours of guesswork.

OTel in particular matters because it lets you instrument once and export to any backend. The semantic conventions are the "common language" that lets a trace from service A be joined against a span from service B even if they are in different languages.

Concrete Example: Minimal Instrumented Code Snippet

Python, using the OpenTelemetry API with semantic conventions:

from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("checkout-service")

def handle_checkout(request):
with tracer.start_as_current_span("POST /checkout") as span:
span.set_attribute(SpanAttributes.HTTP_REQUEST_METHOD, "POST")
span.set_attribute(SpanAttributes.URL_PATH, "/checkout")
span.set_attribute("checkout.cart_size", len(request.items))
try:
inventory_ok = reserve_inventory(request.items)
charge_id = charge_payment(request.user_id, request.total_cents)
order_id = create_order(request.user_id, charge_id)
span.set_attribute("checkout.order_id", order_id)
span.set_attribute(SpanAttributes.HTTP_RESPONSE_STATUS_CODE, 200)
return {"order_id": order_id}
except PaymentDeclined as e:
span.set_attribute(SpanAttributes.HTTP_RESPONSE_STATUS_CODE, 402)
span.set_status(Status(StatusCode.ERROR, "payment declined"))
span.record_exception(e)
raise

Notes:

  • SpanAttributes.HTTP_REQUEST_METHOD, URL_PATH, HTTP_RESPONSE_STATUS_CODE are from the OTel HTTP semantic conventions. Any OTel-aware tool already knows how to query them.
  • Custom attributes (checkout.cart_size, checkout.order_id) use a service-scoped prefix.
  • set_status(Status(StatusCode.ERROR, ...)) makes the span show up as an error in traces and in exemplar-linked metrics.
  • Child calls to reserve_inventory, charge_payment, create_order propagate context automatically when using OTel-instrumented HTTP / gRPC / DB clients -- each becomes a child span.

Common Confusion / Misconception

"Tracing means every request gets traced." At scale this is neither necessary nor affordable. The OTel docs are explicit that for high-volume systems a 1% sample rate or lower often accurately represents the other 99%. Sampling is part of the design; the question is which sampler, not whether to sample.

"Custom attribute names are fine." A random attribute name makes the trace work for you today and useless across tools tomorrow. The OTel semantic conventions exist specifically to solve this -- http.request.method, http.response.status_code, db.system, messaging.destination.name. Prefer them. Prefix anything custom with a stable domain namespace (checkout.*, payments.*).

"Tracing replaces metrics." A trace can tell you exactly what happened in one request, but you can only aggregate traces at scale with sampling, and even then histograms and counters are cheaper and faster. Use metrics for "is anything wrong"; use traces for "where is the time going in this one". Exemplars (Concept 10) are the bridge.

"Head sampling alone is fine." If you keep 1% at the root and errors are 0.1%, you will see almost none of them. Tail sampling -- "keep 100% of errors, plus all traces > p95 latency, plus a small sample of the rest" -- is usually the right production default. The trade-off is memory at the Collector: tail samplers must buffer spans until the trace completes.

"Context propagation just works." It works when you use OTel-instrumented clients and the default W3C traceparent header. It breaks at: message queues (propagate via headers or payload), background jobs (inject at enqueue, extract at dequeue), lambda/serverless boundaries (cold starts can drop context), cross-organization partner calls (partner may not honor your header). Check each boundary.

"Span = function call." Not quite. A span should represent a unit of work worth measuring: an HTTP request, a DB query, a queue publish, a significant computation. A span per function call produces thousands of spans per request and obscures the shape. Instrument libraries instrument the right boundary; you add spans for domain-level operations that cross those.

"Trace sampling decisions are reversible." Head-sampling drops are permanent; you cannot recover a dropped trace. Tail sampling is irreversible at the backend too. If you need "keep-forever just in case", the only option is 100% capture with cheap cold storage and fast-path indexing for the sampled subset.

How To Use It

For a service you instrument:

  1. Adopt an OTel SDK in each language you use; configure exporters to your Collector.
  2. Propagate context on every outbound call (HTTP, gRPC, message bus). Most clients have OTel instrumentation built in; prefer that over hand-rolling.
  3. Use semantic conventions for HTTP, DB, messaging attributes. Prefix custom attributes with your service or domain name.
  4. Decide a sampling strategy. For anything non-trivial, tail-sample at the Collector: keep all errors, all slow (p95+) requests, plus a small percentage of the rest.
  5. Link traces to metrics via exemplars (Concept 10) and to logs via trace_id fields (Concept 11).

Check Yourself

  1. What does trace_id give you that a per-service request ID does not?
  2. Why do semantic conventions matter across languages and vendors?
  3. When does head sampling fail you, and what does tail sampling fix?

Mini Drill or Application

Take one endpoint you have written. Sketch the spans you would emit, their attributes (using semantic conventions where possible), and the status on each error path. Then choose a sampling strategy in one sentence.

See also (external)

Depth Path


Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.