Skip to main content

Observability Instrumentation Workshop

Three outputs required: a small-but-structured log schema, a three-question dashboard, and at least one full critical-path trace.

Retrieval Prompts

  1. What makes a log "structured" as opposed to a formatted string?
  2. Name the three questions every capstone dashboard must answer.
  3. Why should span names be low-cardinality while span attributes can be high-cardinality?
  4. State one sensible sampling policy for a capstone (head rate + tail rules).
  5. How do logs, metrics, and traces get correlated across services in practice?

Compare and Distinguish

  • structured log vs sentence log
  • dashboard for leadership vs dashboard for on-call
  • head-based sampling vs tail-based sampling
  • span name vs span attribute
  • metric-based SLI vs log-based SLI

Common Mistake Check

For each statement, identify the error:

  1. "We log everything, just in case."
  2. "Our dashboard has 34 panels; all of them matter."
  3. "We sample 100% of traces; disk is cheap."
  4. "Span name is db.query(SELECT * FROM users WHERE id=42) so we know exactly which query."
  5. "We don't need trace_id in logs; the timestamps are enough to correlate."

Mini Application

Part A -- Log Schema (20 min)

  1. List every decision boundary in one service of your capstone. Aim for 5-10.
  2. Give each a dotted event name: <area>.<object>.<verb> in past tense.
  3. Define the minimum field set: request_id, trace_id, one business identifier (user_id / tenant_id / provider_id), duration_ms, reason, attempt.
  4. Commit to library/raw/logging.md.

Part B -- Three-Question Dashboard (30 min)

  1. Create a dashboard called capstone-live with three rows: Healthy?, Slow?, Failing whom?
  2. Add panels only if they answer the row header. Target: ≤ 4 panels per row.
  3. Include the SLO / error-budget big-number tile on row 1.
  4. Screenshot and link the dashboard from library/raw/slo.md.

Part C -- Critical-Path Trace (45 min)

  1. Identify the one user journey that defines your SLO.
  2. Instrument the entry point (OTel HTTP middleware or equivalent).
  3. Instrument every outgoing call on that path as a child span: DB, queue, external HTTP.
  4. Propagate traceparent (W3C) across every process boundary, including into queue messages.
  5. Force one real request, open the trace in your tracing UI. If any expected hop is missing, add it.
  6. Set sampling: head 1-5%, always-keep on 5xx and on latency > SLI threshold.

Evidence Check

  • library/raw/logging.md lists events with stable names and fields
  • one service's top five logs are structured, not strings
  • capstone-live dashboard exists, answers all three questions within 10 seconds
  • at least one full trace of the critical path is stored and linkable
  • you can filter traces by a business attribute (tenant, provider, endpoint)
  • library/raw/tracing.md names the sampling policy and the linking convention for runbooks