Observability Instrumentation Workshop
Three outputs required: a small-but-structured log schema, a three-question dashboard, and at least one full critical-path trace.
Retrieval Prompts
- What makes a log "structured" as opposed to a formatted string?
- Name the three questions every capstone dashboard must answer.
- Why should span names be low-cardinality while span attributes can be high-cardinality?
- State one sensible sampling policy for a capstone (head rate + tail rules).
- How do logs, metrics, and traces get correlated across services in practice?
Compare and Distinguish
- structured log vs sentence log
- dashboard for leadership vs dashboard for on-call
- head-based sampling vs tail-based sampling
- span name vs span attribute
- metric-based SLI vs log-based SLI
Common Mistake Check
For each statement, identify the error:
- "We log everything, just in case."
- "Our dashboard has 34 panels; all of them matter."
- "We sample 100% of traces; disk is cheap."
- "Span name is
db.query(SELECT * FROM users WHERE id=42)so we know exactly which query." - "We don't need
trace_idin logs; the timestamps are enough to correlate."
Mini Application
Part A -- Log Schema (20 min)
- List every decision boundary in one service of your capstone. Aim for 5-10.
- Give each a dotted event name:
<area>.<object>.<verb>in past tense. - Define the minimum field set:
request_id,trace_id, one business identifier (user_id/tenant_id/provider_id),duration_ms,reason,attempt. - Commit to
library/raw/logging.md.
Part B -- Three-Question Dashboard (30 min)
- Create a dashboard called
capstone-livewith three rows:Healthy?,Slow?,Failing whom? - Add panels only if they answer the row header. Target: ≤ 4 panels per row.
- Include the SLO / error-budget big-number tile on row 1.
- Screenshot and link the dashboard from
library/raw/slo.md.
Part C -- Critical-Path Trace (45 min)
- Identify the one user journey that defines your SLO.
- Instrument the entry point (OTel HTTP middleware or equivalent).
- Instrument every outgoing call on that path as a child span: DB, queue, external HTTP.
- Propagate
traceparent(W3C) across every process boundary, including into queue messages. - Force one real request, open the trace in your tracing UI. If any expected hop is missing, add it.
- Set sampling: head 1-5%, always-keep on 5xx and on latency > SLI threshold.
Evidence Check
-
library/raw/logging.mdlists events with stable names and fields - one service's top five logs are structured, not strings
-
capstone-livedashboard exists, answers all three questions within 10 seconds - at least one full trace of the critical path is stored and linkable
- you can filter traces by a business attribute (tenant, provider, endpoint)
-
library/raw/tracing.mdnames the sampling policy and the linking convention for runbooks