Blameless Postmortems and Learning from Incidents
What This Concept Is
A postmortem is the written record of an incident: what happened, when, why, and what we will change. A blameless postmortem is one whose explicit contract is that no individual is named as the cause. Humans make locally reasonable decisions; incidents happen because systems let those decisions lead to failure.
The standard postmortem template (Google SRE, Etsy, Facebook, plus many others) has these sections:
- Summary: what broke, who was affected, for how long, and the SLO or business impact.
- Timeline: minute-by-minute narrative from trigger through resolution, with timestamps.
- Impact: users affected, requests failed, revenue lost, SLO budget consumed.
- Root cause and contributing factors: the proximate trigger and the structural conditions that made it possible.
- What went well / what went poorly: honest reflection on response, not just the technical cause.
- Lessons learned: what the incident teaches about the system, the team, or the process.
- Action items: specific, owned, dated changes that would prevent or shorten the next instance.
Richard Cook's How Complex Systems Fail (point 14) is explicit: "after an accident, practitioner actions may be called 'root causes' but they are in fact the proximate events in a long chain of causation." A blameless postmortem resists the temptation to stop at the human.
Why It Matters Here
Incidents are expensive. The return on that investment is the postmortem. Without it, the organization pays for the outage twice: once when it happens, and again when the same class of outage happens next quarter.
Blameless is not about being polite. It is about getting the truth. If naming the person who pushed the bad config gets them fired, nobody tells you what the config actually was. If the review is about shared learning, people write down the actually-useful details. The blameless contract is a truth-optimization mechanism.
This concept is supporting because it depends on the incident-lifecycle concept preceding it, but its quality determines whether the rest of the reliability program compounds.
Concrete Example
Consider the outage from the previous concept (fraud-check deploy, 13-minute MTTR).
Bad postmortem (blamed): "Engineer X deployed a change that made fraud-check 5x slower. This caused a 13-minute outage. Engineer X should be more careful during deploys."
Blameless postmortem (systemic):
- Root cause: new fraud-check algorithm O(n^2) in request batch size; manifests at
>100 itemsper batch, which staging did not test. - Contributing factors:
- load tests did not exercise real-world batch size distribution
- canary duration (
5 minutes) was shorter than the metric window that would reveal the slowdown (10 minutes) - p99 dashboard's SLO alert threshold was set one point above the actual SLO, delaying detection by 90 seconds
- rollback runbook required 3 manual steps, any of which could fail
- Lessons: our pre-deploy validation is not representative of production traffic shape; our canary is too short; our rollback is not one click.
- Action items:
- AI-1: Add batch-size distribution replay to load-test harness. Owner: A. Due: 2 weeks.
- AI-2: Extend canary window to
15 minutesminimum. Owner: B. Due: 1 week. - AI-3: One-click rollback script for service X. Owner: C. Due: 3 weeks.
- AI-4: Review p99 alert threshold alignment for all tier-1 services. Owner: D. Due: 1 month.
The blamed version produces shame and no change. The blameless version produces four improvements.
Common Confusion / Misconception
"Blameless means we ignore human actions." No. It means we examine human actions as inputs to a system that should have caught them. The question shifts from "who did the wrong thing" to "what made the wrong thing possible or easy." The action items should make the wrong thing harder to do next time.
"We did the postmortem, so we're done." The postmortem without action items, or with action items that never ship, is theater. Track action items like any other work. If you are consistently closing postmortems without implementing the actions, the review is cargo-cult.
"The root cause was X." Usually there is no single root cause; there are contributing factors. Name them all. The How Complex Systems Fail view is that catastrophes always require multiple small failures to line up; the review should find and label each.
"Only big incidents get postmortems." Near-misses (the alert fired but mitigation worked fast) and chaos-experiment failures also merit lightweight reviews. Each one is cheap learning material.
How To Use It
- Schedule the review fast. Within
3-5 daysof resolution. Memory decays. - Use the template. Fill every section. Empty sections usually mean skipped thinking.
- Build the timeline from data, not from memory. Chat logs, metrics, dashboards, deploy history. Attach screenshots.
- Distinguish root cause from contributing factors. List several factors; do not stop at the first human action.
- Action items are specific, owned, dated. "We should improve monitoring" is not an action item. "Add synthetic probe for checkout with
1malert threshold, owned by A, due 2025-02-15" is. - Publish widely. The review's value multiplies when teams who were not involved can learn from it. Internal postmortem archives are the organization's reliability memory.
- Close the loop. Review action-item completion quarterly. An action item open for six months is either wrong or de-prioritized; mark it one way or the other.
Check Yourself
- Why does a blameful postmortem make the organization less reliable over time?
- What is the difference between a root cause and a contributing factor?
- Name one specific way an action item can be well-formed and one way it can be vague.
Mini Drill or Application
Pick a public incident (Cloudflare, AWS, GitHub, GitLab, Stripe postmortems are published). Draft a one-page blameless postmortem for it using the template above. Identify at least three contributing factors and propose at least three specific, owned, dated action items.
Read This Only If Stuck
- Fundamentals of Software Architecture: Operations and DevOps
- Fundamentals of Software Architecture: Implicit Characteristics (reliability)
- Google SRE Book: Postmortem Culture - Learning from Failure
- Richard Cook: "How Complex Systems Fail"
- Etsy: "Blameless Post-Mortems and a Just Culture" (John Allspaw)