Observability Design Clinic

Retrieval Prompts

State RED and USE and when to use each.
Explain cardinality and give one label that is almost always a bad idea.
State the purpose of OpenTelemetry semantic conventions in two sentences.
State the difference between head sampling and tail sampling.
State the difference between a symptom alert and a cause alert.

Compare and Distinguish

Separate these pairs clearly:

counter vs gauge vs histogram
structured log vs unstructured log
trace vs log vs metric
head sampling vs tail sampling
exemplar vs label vs attribute

Common Mistake Check

For each statement, identify the error:

"Adding user_id as a label helps us see per-user behavior in metrics."
"We have logs, so we have observability."
"We sample 1% of traces, so errors are captured."
"The dashboard shows green, so the pipeline is running."
"This alert fires on CPU > 80%; it is a symptom alert."

Mini Application

Use local tooling by default: OpenTelemetry Collector, console exporters, Prometheus, Grafana, Jaeger/Tempo, Loki, or structured stdout logs. Paid observability services are optional and must use retention limits plus teardown/cleanup notes.

Design, for a single service, a complete observability spec. You may use a real service or invent one (e.g. "orders-api" with POST /orders, GET /orders/:id, and a background job export-orders).

Part A -- Metrics (30 minutes)

Produce a metrics list of 5-8 entries. For each:

name (Prometheus-style or OTel-style, consistent)
type (counter / gauge / histogram)
labels (bounded in cardinality, justified)
which SLO it supports

Include at least one histogram with SLO-aligned buckets.

Part B -- Logs (20 minutes)

Produce:

the schema of a log line (stable keys with types)
an event vocabulary (5+ named events)
a denylist of fields that must never appear in logs
one example log line in JSON for a failure case

Part C -- Traces (30 minutes)

For one endpoint:

list the spans you would emit
list the attributes per span using OpenTelemetry semantic conventions where possible
name the sampling strategy (head, tail, or combined) and explain why

Part D -- Dashboard (20 minutes)

Design the primary dashboard:

name 3-6 questions it answers
one panel per question
pivot links on each panel (to traces / logs / upstream)
annotations (deploys, incidents)

Part E -- Alerts (20 minutes)

Produce 4 alerts:

at least 2 symptom-based with SLO reference and time window
at least 1 silent-runner / freshness alert
for each, name the runbook filename
for each, label as symptom or cause

Evidence Check

This page is complete when:

every metric has a defended cardinality budget
the trace spec uses semantic conventions where they exist
the dashboard has a named audience and specific questions
every paging alert is a symptom, not a cause
the silent-runner alert exists and has a clear absence-of-progress signal

Retrieval Prompts​

Compare and Distinguish​

Common Mistake Check​

Mini Application​

Part A -- Metrics (30 minutes)​

Part B -- Logs (20 minutes)​

Part C -- Traces (30 minutes)​

Part D -- Dashboard (20 minutes)​

Part E -- Alerts (20 minutes)​

Evidence Check​