Dashboards That Answer Questions, Not Decorations
What This Concept Is
A dashboard is a saved view of signals from the three pillars (metrics, logs, traces) arranged to answer specific questions about a system. The critical word is "questions". A dashboard that shows every metric you have is a wall decoration. A dashboard that answers "is my service healthy right now and, if not, where is the failure?" is an operational tool.
Every useful dashboard has three properties:
- A named audience. Service owner, on-call, incident commander, leadership. The same data often needs different dashboards.
- A small set of questions. Three to six, written down. No panel should exist unless it answers one of them.
- A direction of pivot. Each panel should suggest a next step: "click here to see the slow trace", "open this log query", "compare with the other region".
The Google SRE book's monitoring chapter reinforces this: the purpose of a monitoring system is to help humans decide something -- is this worth waking up for, is this release healthy, which subsystem is the bottleneck? A dashboard that does not lead to a decision is a distraction.
Why It Matters Here
In an incident, the dashboard is read by a tired human under time pressure. Every visual element is competing for attention. Decoration -- pretty gauges, redundant panels, low-signal charts -- actively costs you. Dashboards that were fun to build are often miserable to use.
Dashboards also encode institutional memory. A new on-call engineer who can open "is the service healthy" and answer yes/no in 30 seconds is onboarding in real time. A dashboard that requires a senior engineer to interpret every panel is not actually shared knowledge.
Concrete Example
Bad dashboard (real shape, anonymized):
- 24 panels, 6 rows
- CPU, memory, disk, network for every pod individually
- a big single-stat showing "total requests today" (irrelevant during an incident)
- a tag cloud of error messages
- three identical latency percentile panels for three environments stacked vertically
- no SLO reference lines
Nobody knows what "healthy" looks like. Every incident turns into a tour.
Good dashboard: "Checkout service -- is it healthy for users right now?"
Six panels, each tied to a question:
- User-visible success rate (last 1h) -- percentage of
POST /checkoutrequests returning 2xx, with a reference line at the SLO (e.g. 99.5%). One number, big. Green/red. PromQL:sum(rate(http_requests_total{route="/checkout", status_class="2xx"}[5m])) / sum(rate(http_requests_total{route="/checkout"}[5m])) - User-visible latency -- p50/p95/p99 over 1h, with exemplars attached to the p99 bucket for click-through to a slow trace.
PromQL:
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{route="/checkout"}[5m]))) - Error rate by class (4xx vs 5xx) -- fast read on "is this us or them?"
- Saturation -- one of inflight-requests/CPU/queue-depth, whichever is the known bottleneck, with a threshold line.
- Upstream dependency status -- payments, inventory, orders -- each with its own success rate and a link to its own dashboard.
- Active incidents and recent deploys -- annotations on the time range showing deploys and known incidents.
That dashboard answers: healthy yes/no; if no, is it latency or errors; if errors, 4xx or 5xx; where is the user pain; who upstream is breaking; and is this correlated with a deploy.
Layout discipline: SLO-linked headline at the top, error-class and latency below, saturation and upstream side-by-side, annotations timeline at the bottom. A tired operator scans top-to-bottom and stops reading the moment the top panel is green.
Common Confusion / Misconception
"One dashboard per service." You need a dashboard for every question. Sometimes a fleet dashboard spans many services; sometimes a feature dashboard spans the critical-path services for one user flow (checkout crosses cart, inventory, payments, orders -- one dashboard, one question).
"Dashboards replace SLOs." A dashboard visualizes current state and trends; an SLO is an objective with a budget over a long window (28 days is common). Dashboards should reference SLOs (threshold lines, remaining-error-budget panels, burn-rate gauges) but they are not a replacement. The Google SRE Workbook chapter on implementing SLOs is the canonical treatment.
"Put everything on one dashboard." At some point that dashboard becomes 4 screens tall, and nobody reads below the fold. Prefer a small primary dashboard ("is this healthy?") and linked deeper dashboards for drill-downs ("per-route latency", "per-dependency error rate"). Grafana's own best-practices guide is explicit about this tree-of-dashboards pattern.
"Dashboardize first, alert later." Do not dashboardize metrics you do not also alert on. If the chart is red and no page fires, you are relying on someone staring at the screen. If the chart is green but an alert says the system is broken, the chart is misleading. Alerts and panels should be derived from the same SLI definitions.
"Pretty visuals beat ugly truth." Gauges with red/yellow/green are enjoyable to build and miserable to read at 3 a.m. Prefer single large numbers (success rate %), sparklines for trend, and threshold lines for context. Reserve bar charts for comparison across dimensions (per-region, per-route); use time-series for time-based behavior.
"Annotations are decoration." Annotations (deploy markers, incident ranges, feature-flag changes) are how on-call correlates "when did this start" with "what changed". A dashboard without deploy annotations is missing the single most useful incident-diagnosis signal.
How To Use It
When you design a dashboard:
- Write the questions first. Three to six. Specific: "is the checkout success rate above SLO in the last hour?" not "health".
- Write the intended audience next to it.
- Pick one panel per question. If you cannot build the panel with existing metrics, file a work item for instrumentation (Concept 10).
- Add pivot links: trace exemplars on latency, log query on errors, upstream dashboard link on dependency panels.
- Annotate with deploys and incidents so that time correlation is visible.
- Do a dry run: have a new engineer open the dashboard during a staged failure and see whether they can diagnose in under five minutes.
Check Yourself
- Why is "total requests today" usually the wrong panel during an incident?
- What is the difference between a dashboard and an SLO, and how should they relate?
- How do exemplars change what a latency panel is worth?
Mini Drill or Application
Pick a service you operate. Write the three questions its primary dashboard must answer. Then sketch six panels max and say, for each, which question it answers. If a panel does not answer a question, delete it.
See also (external)
- Google SRE Book: Monitoring Distributed Systems -- the four golden signals (latency, traffic, errors, saturation) and the design principles that make monitoring useful.
- Google SRE Workbook: Implementing SLOs -- how SLIs, SLOs, and error budgets become the reference lines on every dashboard worth building.
- Grafana: Best practices for creating dashboards -- practical panel-count, layout, variable, and link guidance from the tool's maintainers.
- Honeycomb: Observability Glossary / 101 -- working definitions of alerts, dashboards, SLOs and how they compose.
- Grafana Labs: The RED Method -- Tom Wilkie's canonical article on RED dashboarding, including the 3-panel template.
- Building Secure and Reliable Systems, Ch. 16: Disaster Planning -- how dashboards feed into incident command and decision-making under pressure.
Depth Path
Source Backbone
Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.
- Building Secure and Reliable Systems - primary book backbone for security/reliability tradeoffs.
- Software Engineering at Google - support for operational engineering and process.
- The Linux Command Line - support for operational investigation and automation.