Observability Design Clinic
Retrieval Prompts
- State RED and USE and when to use each.
- Explain cardinality and give one label that is almost always a bad idea.
- State the purpose of OpenTelemetry semantic conventions in two sentences.
- State the difference between head sampling and tail sampling.
- State the difference between a symptom alert and a cause alert.
Compare and Distinguish
Separate these pairs clearly:
- counter vs gauge vs histogram
- structured log vs unstructured log
- trace vs log vs metric
- head sampling vs tail sampling
- exemplar vs label vs attribute
Common Mistake Check
For each statement, identify the error:
- "Adding
user_idas a label helps us see per-user behavior in metrics." - "We have logs, so we have observability."
- "We sample 1% of traces, so errors are captured."
- "The dashboard shows green, so the pipeline is running."
- "This alert fires on CPU > 80%; it is a symptom alert."
Mini Application
Use local tooling by default: OpenTelemetry Collector, console exporters, Prometheus, Grafana, Jaeger/Tempo, Loki, or structured stdout logs. Paid observability services are optional and must use retention limits plus teardown/cleanup notes.
Design, for a single service, a complete observability spec. You may use a real service or invent one (e.g. "orders-api" with POST /orders, GET /orders/:id, and a background job export-orders).
Part A -- Metrics (30 minutes)
Produce a metrics list of 5-8 entries. For each:
- name (Prometheus-style or OTel-style, consistent)
- type (counter / gauge / histogram)
- labels (bounded in cardinality, justified)
- which SLO it supports
Include at least one histogram with SLO-aligned buckets.
Part B -- Logs (20 minutes)
Produce:
- the schema of a log line (stable keys with types)
- an event vocabulary (5+ named events)
- a denylist of fields that must never appear in logs
- one example log line in JSON for a failure case
Part C -- Traces (30 minutes)
For one endpoint:
- list the spans you would emit
- list the attributes per span using OpenTelemetry semantic conventions where possible
- name the sampling strategy (head, tail, or combined) and explain why
Part D -- Dashboard (20 minutes)
Design the primary dashboard:
- name 3-6 questions it answers
- one panel per question
- pivot links on each panel (to traces / logs / upstream)
- annotations (deploys, incidents)
Part E -- Alerts (20 minutes)
Produce 4 alerts:
- at least 2 symptom-based with SLO reference and time window
- at least 1 silent-runner / freshness alert
- for each, name the runbook filename
- for each, label as symptom or cause
Evidence Check
This page is complete when:
- every metric has a defended cardinality budget
- the trace spec uses semantic conventions where they exist
- the dashboard has a named audience and specific questions
- every paging alert is a symptom, not a cause
- the silent-runner alert exists and has a clear absence-of-progress signal