Alerting on Symptoms, Not Causes; the Silent Runner Problem

What This Concept Is

A symptom alert fires when something is wrong from the user's perspective: success rate dropped below SLO, latency breached, queue is backing up faster than it can drain, data pipeline is behind by more than an hour.

A cause alert fires on an internal proxy for a problem: CPU above 90%, memory above threshold, "restart loop detected", "individual pod went unhealthy", "one of N replicas is down".

Both have a place. The discipline is: symptom alerts are what page on-call; cause alerts are diagnostic signals used during investigation or for trend/capacity planning. The Google SRE book's practical-alerting chapter is the canonical statement of this; its design push toward "simpler and more robust" alerts and the "Four Golden Signals" (latency, traffic, errors, saturation) maps directly onto symptom-first alerting.

The silent runner problem is the complement. A system can be "silently" broken: jobs not running, batch not producing output, queue not draining, scheduled task skipped. CPU is low, memory is fine, no 5xxs, and the work is not getting done. The only way to catch silent runners is to alert on the absence of progress -- a heartbeat missed, a counter that failed to advance, a data freshness SLA exceeded.

Why It Matters Here

Alert quality decides whether on-call is sustainable. The SRE book's monitoring chapter is blunt about this: paging engineers is expensive, and alert fatigue is a real cost. Every noisy alert makes the next real one less likely to be handled correctly. Every quiet failure (silent runner) teaches the team to distrust the system's green state.

Alert design is also where security and reliability meet: the payment_declined spike, the anomalous egress volume, and the CSPM CRITICAL finding are all symptom alerts -- they describe user/business impact, not an internal cause.

Concrete Example: 2 Bad and 2 Good Alerts

Bad alert 1: "CPU > 80% for 5 minutes on any pod."

Why it is bad: it is a cause, not a symptom, and CPU high is not the same as user harm. Modern services sometimes should run hot. Pages from this alert train the on-call to ignore it. When the real incident comes, nobody rushes to a CPU alert.

Better: success_rate < SLO or p99_latency > SLO for the service on that host. Let CPU be a diagnostic panel, not a pager.

Bad alert 2: "One of 5 replicas is unhealthy."

Why it is bad: the system is designed for replica failure. An alert at the first replica loss is a cause, and a loud one. If the service keeps serving, there is no user symptom yet.

Better: replicas_available < min_viable_count AND user_success_rate degraded, or no_healthy_replicas. The existence of replicas is the cause; whether users see errors is the symptom.

Good alert 1: "POST /checkout success rate < 99.5% over the last 10 minutes."

Why it is good: directly user-visible, tied to an SLO, includes a time window that suppresses single-blip noise, scoped to an endpoint that matters.

Good alert 2: "Order export pipeline has produced no new batch in the last 70 minutes (SLA: 60 minutes)."

Why it is good: catches the silent-runner case. The job was scheduled; it did not fail loudly; it simply did not run. The alert fires on absence of progress, which no CPU/memory alert can detect.

As a Prometheus alerting rule:

groups:
- name: checkout.symptom-alerts
  rules:
  - alert: CheckoutSuccessRateBelowSLO
    expr: |
      sum(rate(http_requests_total{route="/checkout",status_class="2xx"}[10m]))
        /
      sum(rate(http_requests_total{route="/checkout"}[10m]))
      < 0.995
    for: 10m
    labels:
      severity: page
      team: checkout
    annotations:
      summary: "Checkout success rate < 99.5% for 10m"
      runbook_url: "https://runbooks/checkout-success-rate"

  - alert: ExportBatchStale
    expr: time() - max(export_last_batch_timestamp_seconds) > 4200  # 70m
    for: 5m
    labels:
      severity: page
      team: data-platform
    annotations:
      summary: "Export pipeline has produced no new batch in >70m (SLA 60m)"
      runbook_url: "https://runbooks/export-freshness"

The for: 10m is what suppresses blips; the runbook_url is enforced by a fitness function that rejects rules missing it.

Common Confusion / Misconception

"More alerts = more safety." The opposite. Past a small number of alerts per person per week, on-call starts triaging by ignoring. The Google SRE book explicitly calls alert volume a cost to control, not a coverage metric. A healthy rotation has single-digit pages per week per person.

"Symptom vs cause is absolute." It is context-dependent. A disk-full alert is a cause for the application, but a symptom for the storage team. The question is always "who pages, and does this audience care whether users are being harmed right now?"

"Every symptom alert needs a runbook." Yes -- and that is a hard rule, not a best effort. An alert that wakes someone up and does not tell them what to do is just noise with human cost. The runbook_url label on the alert is enforced by CI (a fitness function that rejects rules without one). This is the subject of Concept 15.

"Silent-runner alerts are hard because they are exotic." They are hard because they require expected-behavior modeling: "a batch should land every hour", "a heartbeat metric should advance at least every 60s", "this DAG should complete in under 20 min". If you do not encode the expectation somewhere (a last_batch_timestamp gauge, a heartbeat_total counter), the alert cannot exist. Emit the heartbeat from the job; alert on its age.

"Multi-window burn-rate alerts are over-engineering." The SRE Workbook's alerting-on-SLOs chapter makes the case: a single-threshold alert over a single window is either too noisy (short window) or too slow (long window). A two-window burn-rate alert (e.g. "burn rate > 14.4 over 1h and > 14.4 over 5m") gets both fast detection and low false-positive rate. Worth the complexity for user-facing services.

"Security alerts follow different rules." The principles are the same. A CWPP alert on "shell spawned in container" is a symptom of a possible compromise; it should page, carry a runbook, and pivot to an investigation playbook. Resource-usage signals from security tooling (CPU spikes on an agent) are causes, not pages.

How To Use It

For every service, fill out an alert sheet:

Alert name	Symptom or cause	User-visible effect	Window	Runbook
Checkout success rate SLO	Symptom	Users cannot buy	10 min	`/runbooks/checkout-success-rate.md`
Checkout p99 latency SLO	Symptom	Users see slow checkout	15 min	`/runbooks/checkout-latency.md`
Export batch freshness	Symptom (silent runner)	Downstream reports stale	70 min	`/runbooks/export-freshness.md`
Replicas < 2 of 5	Cause (no page)	None yet	5 min	Diagnostic only, no page

Every row in the pager column should be a symptom; causes should be diagnostic only.

Check Yourself

Give one alert from a system you have seen that was a cause pretending to be a symptom. What would the symptom alert have been instead?
What is the silent-runner problem and what signal detects it?
Why is there a time window on a symptom alert, and how do you pick it?

Mini Drill or Application

For a service you know, write 4 alerts: 2 must be symptoms (with SLO reference and a window), at least 1 must catch a silent-runner case (freshness or heartbeat), and each must name the runbook file that will exist in the next concept's output.

Depth Path

Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.

Building Secure and Reliable Systems - primary book backbone for security/reliability tradeoffs.
Software Engineering at Google - support for operational engineering and process.
The Linux Command Line - support for operational investigation and automation.

What This Concept Is​

Why It Matters Here​

Concrete Example: 2 Bad and 2 Good Alerts​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

See also (external)​

Depth Path​

Source Backbone​