Postmortem — \<INCIDENT-TITLE>¶
Blameless. Action-item-driven. Published widely. The goal is that the same mistake does not happen twice. This template is the fuller version of
../runbooks/postmortem-template.mdwith explicit 5 Whys + structured action-item tracking, used for any P0/P1 incident per the severity matrix.
Metadata¶
- Incident ID: INC-YYYY-MM-DD-NNN
- Title: \<one line; the headline a non-engineer can read>
- Date / Time (UTC):
- First alert: YYYY-MM-DD HH:MM
- All-clear: YYYY-MM-DD HH:MM
- Duration of customer-visible impact: \<HH:MM>
- Severity: P0 | P1
- Affected tenants: \<list or "all">
- Affected regions: \<list>
- Affected services: \<list>
- Detection method: automated alert | customer report | internal observation | DR drill discovery
- Incident Commander (IC): \<name> @ \<team>
- On-call primary at time of incident: \<name>
- On-call secondary at time of incident: \<name>
- Recorder (timeline scribe): \<name>
- Author of this postmortem: \<name>
Summary (1 paragraph)¶
A 2-4 sentence executive summary. What happened, who was affected, what the root cause was, how we mitigated it. Read it in 30 seconds. The detail lives below. A non-engineer should be able to follow this paragraph.
Example shape:
Between HH:MM and HH:MM UTC on YYYY-MM-DD, \<feature> was \<degraded / unavailable> for \<affected scope>. Root cause was \<technical cause in plain language>. Customer impact: \<concrete numbers>. We mitigated by \<action> and prevented recurrence by \<action item>.
Customer impact¶
- Number of customers affected: \<exact count if known; estimate otherwise>.
- Tenants affected (named): \<list; mark which got proactive comm>.
- Revenue impact: \<estimated $ — orders lost, refunds issued, SLA credits owed>.
- Data impact: \<none | corrupted | lost — be specific; the postmortem reader needs to know if data is OK>.
- SLO impact: \<error budget consumed: how much of the monthly budget>.
- Status-page entry: \<URL>.
- Customer-comm artifacts: \<email IDs, status page updates, support tickets>.
Timeline (UTC)¶
The IC or recorder maintains this DURING the incident. Every entry has a UTC timestamp. Cite specifics: command output, Slack message links, PagerDuty incident ID. A postmortem written from a sparse timeline is guesswork; a postmortem written from a dense timeline writes itself.
| Time | Event | Source / link |
|---|---|---|
| HH:MM | \<first observation; what fired; who saw it first> | \<link> |
| HH:MM | \<PagerDuty page sent; receiver> | \<PD incident ID> |
| HH:MM | \<primary acked> | \<PD> |
| HH:MM | \<runbook opened; which one> | \<runbook link> |
| HH:MM | \<Step 1 (Check) — what was observed; PASS/FAIL> | |
| HH:MM | \<Step 2 (Diagnose) — branching decision; chosen path> | |
| HH:MM | \<incident escalated (if applicable); to whom; why> | |
| HH:MM | \<IC announced; war room opened (URL)> | |
| HH:MM | \<first status-page update posted> | \<status-page link> |
| HH:MM | \<Step 3 (Mitigate) — what was tried; outcome> | |
| HH:MM | \<additional mitigation tried; outcome> | |
| HH:MM | \<root cause identified (working hypothesis)> | |
| HH:MM | \<customer-visible impact ended (metrics returned to green)> | \<dashboard> |
| HH:MM | \<status-page update: monitoring> | |
| HH:MM | \<status-page update: resolved (after 15-min clean window)> | |
| HH:MM | \<all-clear declared by IC> | |
| HH:MM | \<incident channel closed; recorder finalized timeline> |
Root cause (5 Whys)¶
The technical chain of events that allowed the incident. We use 5 Whys because the surface explanation almost never names the system-level cause. Each "why" drives one level deeper. Stop when the answer is "we made a deliberate trade-off" or "the system was designed this way" — that is the true root.
Example application of 5 Whys:
Problem: Checkout returned 5xx for 28 minutes.
Why 1 — Why did checkout return 5xx? The payments-gateway pod was OOMKilled by the kernel.
Why 2 — Why was it OOMKilled? Memory usage spiked when a customer submitted a 50 MB cart payload that exceeded the gateway's in-memory parser buffer.
Why 3 — Why was there no size limit? The gateway's request-size limit was set to 64 MB (default) instead of a tighter 1 MB; the 1 MB limit existed in the storefront but was not propagated to the gateway.
Why 4 — Why was the limit not propagated? The two services use independent configurations; the limits drifted when the gateway was rewritten in Q1 and the storefront's limit was never copied over.
Why 5 — Why didn't we catch the drift? We have no configuration-consistency check across services with the same logical contract. Code reviews focus on per-service changes; cross-service invariants are invisible.
Root cause: No mechanism to enforce cross-service configuration invariants. The OOMKill is a symptom; the missing invariant check is the system-level cause.
Use this structure. Stop at the deepest "why" that admits an action item.
-
Why 1 — \<surface question> \<answer>
-
Why 2 — \<deeper question> \<answer>
-
Why 3 — ...
-
Why 4 — ...
-
Why 5 — ...
-
Root cause statement: \<one or two sentences naming the system-level cause>
Contributing factors¶
The 5-Whys identifies the primary cause. Contributing factors are the secondary system gaps that allowed the primary cause to become an incident.
- Contributing factor 1: \<e.g. monitoring gap — the alert that should have fired earlier didn't, because…>.
- Contributing factor 2: \<e.g. runbook gap — the runbook for this alert was last updated 18 months ago and the dashboard panel had moved>.
- Contributing factor 3: \<e.g. communication gap — the IC was not named for the first 12 minutes>.
Why we did not catch this earlier¶
This is the most important section. Every incident is a monitoring failure. If the alert had fired sooner, the incident would have been shorter or smaller. Name the monitoring gap:
- The alert that should have caught this: \<alert name, why it didn't fire — wrong query, wrong threshold, wrong service, no alert exists>.
- The dashboard panel that would have shown this: \<which panel, why it wasn't checked — not in the runbook, not on the on-call's standard dashboard set>.
- The chaos experiment that would have surfaced this: \<which experiment, why we don't run it — not in the workflow, scope too narrow>.
What went right¶
Specific things to celebrate. Public acknowledgment of what worked is as important as naming what didn't — it reinforces the resilience patterns the team should keep investing in.
- \<the runbook step that resolved the page in 2 minutes>.
- \<the alert that fired correctly within 30 seconds of the threshold breach>.
- \<the chaos game-day rehearsal three weeks ago that meant the team knew exactly which command to run>.
- \<the PDB / replicaCount / circuit breaker that contained the blast radius>.
What went wrong¶
Specific gaps. Systems, not people. Every human action a responder took was the rational choice given the information available at the time. The question is: "what system let that choice be the most rational one?"
- \<the runbook did not cover the case where X>.
- \<the alert fired but routed to the wrong PagerDuty service>.
- \<the dashboard panel was broken — broken for 6 weeks, nobody noticed>.
- \<the IC was not named for 12 minutes; the on-call primary tried to debug AND communicate AND coordinate, doing all three poorly>.
Action items¶
Each item: a clear deliverable + an explicit owner + a tracking ticket + a due date. No item without all four. Items missing any field are the postmortem reviewer's job to fix before the doc is published.
| ID | Description | Owner | Ticket | Due | Priority | Status |
|---|---|---|---|---|---|---|
| A1 | \<add the alert that should have caught this> | \<eng-name> | \<JIRA-PROJ-NNN> | YYYY-MM-DD | P1 | open |
| A2 | \<update runbook X step 3 to clarify command> | \<eng-name> | \<ticket> | YYYY-MM-DD | P2 | open |
| A3 | \<add chaos experiment for this failure mode> | \<eng-name> | \<ticket> | YYYY-MM-DD | P3 | open |
| A4 | \<write a defender for the cross-service invariant> | \<eng-name> | \<ticket> | YYYY-MM-DD | P1 | open |
| A5 | \<add status-page automation to reduce IC overhead> | \<eng-name> | \<ticket> | YYYY-MM-DD | P2 | open |
Priority guide: - P1 action item: prevents recurrence; ship this sprint. - P2 action item: reduces the impact of recurrence; ship next sprint. - P3 action item: monitoring / runbook / documentation improvement; ship this quarter.
Target: > 80 % action-item closure rate by the postmortem's quarterly review (see ch.15.11 — postmortem review). The platform lead tracks the closure rate as a team metric.
Lessons¶
A short list of takeaways. NOT prescriptive ("we must do X") — those are in the action items. The lessons section is for the insights that emerged from the incident.
- \<the PDB held; the resilience pattern worked as designed>.
- \<the runbook's diagnose tree is missing the case where Stripe returns 502 instead of 500>.
- \<we do not have a chaos experiment for the saga-compensation flow under partial-failure>.
- \<the IC role wasn't claimed for 12 minutes; we need an automated "you are the IC" Slack ping when a P0 fires>.
Related artefacts¶
- Runbooks consulted: \<links to the runbook files used during triage>.
- Dashboards consulted: \<links to Grafana dashboards>.
- Alerts that fired: \<list of alert names + their PrometheusRule paths>.
- PagerDuty incident: \<PD URL>.
- Incident Slack channel: \<#inc-YYYY-MM-DD-NNN URL>.
- War-room Zoom recording: \<URL if recorded>.
- Commits / PRs related: \<the change that triggered the incident; the fix; the action-item PRs>.
- Chaos experiments related: \<if this incident inspired or relates to a chaos experiment>.
Customer communication¶
- Status-page entries:
- \<URL of initial entry>
- \<URL of update 1>
- \<URL of resolution>
- Email to affected tenants: \<draft + final + recipients + send time>.
- Support tickets opened by customers: \<count + IDs>.
- Public-facing postmortem published? Y / N — link if yes.
Publication checklist¶
- Posted in
#bookstore-platform-postmortemsSlack channel. - Linked in the platform's GitHub Wiki /
docs/postmortems/. - Filed in the platform's GitHub Project board (postmortem-tracking).
- Action items have owners + due dates + tracking tickets.
- Scheduled for discussion at the next platform all-hands.
- Public-facing version (sanitized) posted to status page if P0.
- Customer-success notified if any tenant requires direct comm.
Sign-off¶
| Role | Name | Date |
|---|---|---|
| Incident Commander | \<name> | YYYY-MM-DD |
| Platform Lead | \<name> | YYYY-MM-DD |
| Eng Director | \<name> | YYYY-MM-DD |
| (P0 only) CTO | \<name> | YYYY-MM-DD |
A note on blamelessness¶
This document names systems, not people. Every action a human took was the rational choice given the information they had at the time. If a human action contributed to the incident, the question is "what system let that choice be the most rational one?" — never "who do we blame?".
The action items are about the system. Never about a person. A postmortem that recommends "Alice should be more careful" is a broken postmortem; the corresponding fix is "Alice should have a tool / check / review that makes carelessness impossible."
A note on the 5 Whys¶
The 5 Whys is a tool, not a rule. Sometimes you stop at 3 Whys because the root is clear. Sometimes you go to 7 Whys because the chain is deep. The discipline is: keep asking until the answer is a system-level truth, not a surface symptom. The number "5" comes from Toyota's original method; the spirit is "deep enough to act on, not so deep that the action item is too abstract to ship."