Skip to main content

Observability Design Clinic

Retrieval Prompts

  1. State RED and USE and when to use each.
  2. Explain cardinality and give one label that is almost always a bad idea.
  3. State the purpose of OpenTelemetry semantic conventions in two sentences.
  4. State the difference between head sampling and tail sampling.
  5. State the difference between a symptom alert and a cause alert.

Compare and Distinguish

Separate these pairs clearly:

  • counter vs gauge vs histogram
  • structured log vs unstructured log
  • trace vs log vs metric
  • head sampling vs tail sampling
  • exemplar vs label vs attribute

Common Mistake Check

For each statement, identify the error:

  1. "Adding user_id as a label helps us see per-user behavior in metrics."
  2. "We have logs, so we have observability."
  3. "We sample 1% of traces, so errors are captured."
  4. "The dashboard shows green, so the pipeline is running."
  5. "This alert fires on CPU > 80%; it is a symptom alert."

Mini Application

Use local tooling by default: OpenTelemetry Collector, console exporters, Prometheus, Grafana, Jaeger/Tempo, Loki, or structured stdout logs. Paid observability services are optional and must use retention limits plus teardown/cleanup notes.

Design, for a single service, a complete observability spec. You may use a real service or invent one (e.g. "orders-api" with POST /orders, GET /orders/:id, and a background job export-orders).

Part A -- Metrics (30 minutes)

Produce a metrics list of 5-8 entries. For each:

  • name (Prometheus-style or OTel-style, consistent)
  • type (counter / gauge / histogram)
  • labels (bounded in cardinality, justified)
  • which SLO it supports

Include at least one histogram with SLO-aligned buckets.

Part B -- Logs (20 minutes)

Produce:

  • the schema of a log line (stable keys with types)
  • an event vocabulary (5+ named events)
  • a denylist of fields that must never appear in logs
  • one example log line in JSON for a failure case

Part C -- Traces (30 minutes)

For one endpoint:

  • list the spans you would emit
  • list the attributes per span using OpenTelemetry semantic conventions where possible
  • name the sampling strategy (head, tail, or combined) and explain why

Part D -- Dashboard (20 minutes)

Design the primary dashboard:

  • name 3-6 questions it answers
  • one panel per question
  • pivot links on each panel (to traces / logs / upstream)
  • annotations (deploys, incidents)

Part E -- Alerts (20 minutes)

Produce 4 alerts:

  • at least 2 symptom-based with SLO reference and time window
  • at least 1 silent-runner / freshness alert
  • for each, name the runbook filename
  • for each, label as symptom or cause

Evidence Check

This page is complete when:

  • every metric has a defended cardinality budget
  • the trace spec uses semantic conventions where they exist
  • the dashboard has a named audience and specific questions
  • every paging alert is a symptom, not a cause
  • the silent-runner alert exists and has a clear absence-of-progress signal