Skip to main content

Incident Lifecycle: Detect, Triage, Mitigate, Resolve, Review

What This Concept Is

An incident moves through five phases. Naming them prevents the common mistake of trying to do them all at once.

  1. Detect: a signal indicates something is wrong. Usually an SLO-burn alert or a customer report. Clock starts.
  2. Triage: assess scope, severity, and who needs to be involved. Decide whether to page a broader team, open a war room, or handle as a single responder.
  3. Mitigate: make the user-visible impact stop. This is explicitly not the same as "understand and fix." Roll back the deploy, flip the feature flag, shed traffic, failover the region. Restore service now; understand later.
  4. Resolve: the root condition is fixed (not just masked). The feature flag that was flipped is removed; the bad deploy is properly rolled out forward; the under-provisioned cluster is resized.
  5. Review: postmortem (next concept). The incident becomes learning material.

Two clocks matter:

  • MTTD (Mean Time To Detect): from cause to detection. Ideally minutes.
  • MTTR (Mean Time To Recover): from detection to user-visible mitigation. Ideally minutes-to-hours, not hours-to-days.

Incident Commander (IC) is a role, not a seniority level. The IC coordinates: keeps a running timeline, asks "what are we doing now and what is the next decision," assigns sub-tasks, and keeps the team from tunneling on the fix.

Why It Matters Here

Most unnecessarily long outages are caused by confused phase boundaries. Teams try to understand before mitigating (users wait). Teams fix before rolling back (bug gets worse). Teams skip review (same incident happens again in six weeks).

Mitigation is a distinct skill: "what is the fastest action that stops user harm with acceptable side effects?" The answer is almost never "deploy a code fix." It is almost always "revert, failover, or shed."

The Google SRE Book dedicates two chapters to this: one on managing incidents, one on emergency response. The key insight: the discipline of naming a phase keeps people focused on the next decision, not the whole problem.

Concrete Example

14:07 UTC, a latency alert fires. Checkout service p99 has crossed 600ms (SLO is 500ms).

Detect (14:07): the alert pages the on-call engineer. MTTD so far: unknown until we find the cause.

Triage (14:08-14:12): on-call opens the dashboard, confirms impact (5% of users affected), opens an incident channel, declares a Sev-2, appoints themselves IC. Pages a second engineer for support. Notes that a fraud-check service deploy happened at 14:05.

Mitigate (14:12-14:20): IC decides to roll back the fraud-check deploy. Rollback succeeds at 14:18. p99 returns to normal by 14:20. User-visible impact stops here. MTTR ≈ 13 minutes from detection, ≈ 15 minutes from actual cause.

Resolve (14:20-16:00): team inspects the rolled-back commit. Finds a 5x slowdown in the fraud-check's new algorithm under real traffic patterns (did not appear in staging). A fix is written, reviewed, and tested with synthetic load; deployed to a canary, validated, and rolled forward by 16:00.

Review (next day): blameless postmortem. Timeline reconstructed, causes identified, action items filed. Not "who deployed the bug" but "why did our pre-deploy load test not catch a 5x slowdown, and what signal would have?"

Common Confusion / Misconception

"We need to understand the root cause before we fix it." False when users are affected. Mitigation exists to stop user pain. Understanding comes after. The pilot does not diagnose the engine in flight; they land the plane.

"The Incident Commander is the person who fixes the bug." No. The IC coordinates. If the IC is also hands-on-keyboard, they are not commanding; they are debugging. For any incident lasting more than a few minutes or involving more than one person, split the roles.

"We resolved it - no need for a postmortem." Every incident above a threshold (Sev-2 and up, typically) gets a postmortem. Even "non-user-facing" near-misses merit a lightweight review. The review is how individual debugging becomes organizational learning.

"Paging on causes is fine." Paging on causes produces alert fatigue and misses real user impact. Page on symptoms (SLO burn, error rate, user-visible latency). Use causes as investigation inputs once a page has fired.

How To Use It

For every service:

  1. Define severity levels. Sev-1: major user impact, all-hands; Sev-2: significant impact, on-call + support; Sev-3: limited impact, business-hours. Document them.
  2. Define paging policy. Only symptoms. Only actionable. Only to people who can actually help in the next 10 minutes.
  3. Name the IC role explicitly. First responder declares themselves IC or hands off. The IC tracks decisions, time, and hands out tasks.
  4. Practice mitigation. Feature flags, safe rollbacks, region-failover runbooks, traffic shedding. Rehearse them (see chaos engineering). A mitigation you have never run is a mitigation you will not run at 3am.
  5. Close the loop. Every incident above Sev-3 gets a postmortem within a week (next concept). Action items get owners and dates.

Check Yourself

  1. Why is "fix before you understand" the right default during active user impact?
  2. What is the difference between MTTD and MTTR, and why do you optimize them separately?
  3. Why should the Incident Commander usually not be the person with hands on the keyboard?

Mini Drill or Application

Pick any public outage postmortem. Annotate it with the five phases: mark when each started, what the transition signal was, and how long each lasted. Identify at least one phase that took longer than necessary and propose the mitigation that would have compressed it.

Read This Only If Stuck