Skip to main content

A Dashboard That Answers 3 Specific Questions

What This Concept Is

A dashboard is not a wall of metrics; it is the answer to a small set of pre-declared questions. For a capstone, there is exactly one dashboard worth building, and it must answer these three questions, in this order, within ten seconds of opening it:

  1. Is it healthy right now? -- are we meeting our SLOs in the current window?
  2. Is it slow right now? -- is p95 / p99 latency within our latency SLI target?
  3. Is it failing whom? -- which users, tenants, endpoints, or dependencies are bearing the pain?

If the dashboard cannot answer those three in order, it is not a capstone dashboard -- it is a museum.

Every panel on the dashboard should map to one of those three questions. Anything that maps to none of them does not belong on this dashboard; it can live on a "deep-dive" dashboard that nobody looks at until needed. The three-question discipline is a compressed form of Google SRE's four golden signals (latency, traffic, errors, saturation) -- traffic and saturation are mostly capacity metrics, so for operations-under-pressure they collapse into two rows (health + speed) with a third row (failing-whom) that Grafana calls dimensional analysis.

Dashboards are opinionated, not neutral. Every panel you add is a claim that, at 2 a.m., the answer to one of three questions will come faster because of it. Every panel that fails that claim is slowing you down, even if the data is useful elsewhere.

Why It Matters Here (In the Capstone)

At 2 a.m., you do not need every metric. You need the exact picture that tells you whether the fast-burn alert was real (question 1), what dimension the pain is spread across (question 3), and how close to the latency budget you are (question 2).

You are also the audience. A dashboard that takes 40 seconds of scrolling before you see anything actionable will be skipped under stress, and then its existence is a lie. The runbooks in cluster 5 all cite this dashboard in their "Checks" section -- if the dashboard is unreliable, the runbooks are.

Concrete Example(s) -- from a real capstone

For the webhook-handler capstone, the single dashboard capstone-live has three rows:

Row 1 -- Healthy? (question 1)

  • big-number tile: availability SLI in last 1h (target: 99.5%, green >= 99.5, yellow 99.0-99.5, red <99.0)
  • big-number tile: error budget remaining (% of 30-day budget; tiers from concept 2)
  • big-number tile: current burn-rate multiplier (x1 on target, >= x6 slow-burn, >= x14.4 fast-burn)
  • sparkline: error rate (5xx / total) over last 6h, with SLO threshold overlay
  • tile: status of top 3 external dependencies (notification API, DB, queue -- last health check)

Row 2 -- Slow? (question 2)

  • line chart: p50 / p95 / p99 request latency (last 6h), with SLO threshold line at p95 = 300 ms
  • line chart: queue publish latency (last 6h)
  • line chart: downstream notification API latency (last 6h)
  • tile: saturation proxy (active worker count vs max; connection-pool utilization)

Row 3 -- Failing whom? (question 3)

  • table: top 10 endpoints by error rate (last 1h), with count
  • table: top 10 providers / tenants by error rate (last 1h), with count
  • bar chart: errors by reason code (last 1h), sourced from the reason field in structured logs
  • table: top 10 traces over latency SLI threshold (last 1h), linked to the tracing UI

Twelve panels total across three rows. No CPU graph, no memory graph, no "infrastructure overview." Not because those are useless -- they are not -- but because they do not answer the three questions this dashboard is for.

Rendering spec (opinionated):

  • Top row always visible without scrolling on a 14" laptop. No exceptions.
  • Red/yellow/green thresholds on every SLI tile so the answer is "green" before a human has to compute it.
  • Time range control defaults to "last 1 hour"; the instant you click into a problem, 24h and 7d toggles give you context.
  • Every panel has a one-sentence description of the question it answers. Grafana supports panel descriptions; use them.

Common Confusion / Misconceptions

"More panels = more visibility." More panels = more time before you find the answer. The cost of a panel is attention, not pixels. Every panel you add to a dashboard slows the others down. The "Service Health -- Golden Signals" template on Grafana's dashboard library fits on one screen intentionally.

"We need a CPU graph on the main dashboard." If CPU affects user experience, it shows up in question 2 (latency) or question 1 (errors). If it does not, it is a capacity-planning metric, not a live-ops one. Put it on a separate "capacity" dashboard.

"The dashboard is for leadership." That is a different dashboard. This one is for you, at 2 a.m., with one monitor open. Do not conflate the two. An executive dashboard shows trend and spend; an operations dashboard shows now.

"We'll build it later when we know what to watch." You know now. The three questions above are the three questions every on-call operator asks. Build the dashboard, then adjust the panels as you learn.

"One dashboard per service." That scales badly past two services. A capstone dashboard is one page that covers the whole SLO path, not one page per box on the architecture diagram. Drill-down dashboards for individual services are a second tier; they are opened from the main dashboard via panel links, not instead of it.

"We'll just look at the underlying metrics during incidents." You will not. At 2 a.m. you open the dashboard you open every day. If that dashboard is not the one that answers the three questions, the three questions will not get answered.

How To Use It (In Your Capstone)

  1. Open a blank dashboard. Add three row headers: "Healthy?", "Slow?", "Failing whom?"
  2. For each row, write the single question it must answer in the header description.
  3. Add panels only when they directly answer the header question, starting with the SLO-linked tiles from concept 1.
  4. Hide or delete any panel that cannot be defended against a header question.
  5. Add explicit thresholds and colour coding to every SLI tile, so the answer is visible before interpretation.
  6. Open the dashboard at 3 a.m. once (dry-run, not a real incident). If any answer takes more than 10 seconds, rework the row.
  7. Link the dashboard URL from library/raw/slo.md, from each runbook, and from the PRR checklist. A dashboard without cross-links is unused by design.

A worked "dry-run" at 3 a.m.:

StepPanel I look atDecision
0sRow 1, tile 1 (availability SLI, 1h)If green: "am I sure the pager was right?"
3sRow 1, burn-rate tileIf >6x and sustained: real, continue. If spike that already recovered: ticket not page.
5sRow 1, dependency status tileIf a dependency is red: prioritise that runbook.
7sRow 3, top errors tableIf a single reason=X dominates: that is the incident.
10sOpen the linked runbook from the panelBegin runbook; stabilise first, diagnose second.

That is the ten-second budget, annotated. Print the sequence into the runbook itself so a half-awake operator does not have to remember it.

See also (integrative)

Check Yourself

  1. Why does the dashboard start with the SLO panel, not the CPU panel?
  2. Which of the three questions would a "top 10 slowest tenants" panel answer?
  3. Give one panel you have on your current dashboard that does not answer any of the three questions. What should happen to it?
  4. Why is a color-coded SLI tile more valuable than an uncoloured chart showing the same number?
  5. What is the difference between this dashboard and the one you would hand to a product executive?
  6. Why must each panel carry a one-sentence description of the question it answers?

Mini Drill or Application (Capstone-scoped)

  1. In 30 minutes, open your dashboarding tool. Create a new dashboard called capstone-live.
  2. Add three rows with exactly the three question headers above.
  3. Populate each row with no more than 4-5 panels. Each panel must have a one-line description of which question it answers.
  4. Add red/yellow/green thresholds to every SLI tile, anchored to the targets in library/raw/slo.md.
  5. Export or screenshot the dashboard and link it from the SLO document, the runbooks (cluster 5), and the PRR checklist (concept 15).
  6. Delete or move every panel on your existing dashboard that does not answer one of the three questions. Record the deletions in a one-line CHANGELOG entry.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.