incident/ — Incident response artefacts¶
The files in this directory operationalize the incident-response lifecycle: detect → triage → resolve → learn. Each file maps to one phase + the human role that owns it. The chapters 15.10 and 15.11 are the walkthroughs; the files here are the artefacts you run.
The four phases¶
The lifecycle has four phases. Confusing them is the most common incident-response failure mode in our experience.
| Phase | Goal | Owner | Time scale | Artefact in this dir |
|---|---|---|---|---|
| Detect | Discover the issue exists | Alertmanager + PD | seconds | pagerduty-integration.yaml |
| Triage | Decide severity + claim IC | On-call primary + IC | first 5 min | severity-matrix.md |
| Resolve | Stop the customer-visible impact | On-call + responders | 5 min - 4 hrs | incident-channel-bot-config.md |
| Learn | Prevent the same incident recurring | IC + author + team | days to weeks | postmortem-template.md + sample-postmortem-2026-04-15.md |
Detect is automated; humans don't show up here unless the page fires. Triage is where the on-call enters the loop. Resolve is where the team enters the loop. Learn is where the organization enters the loop.
The most common mistake: jumping from Detect straight to Resolve, skipping Triage. The page fires; the on-call starts debugging; nobody declares the severity, nobody opens a status-page entry, nobody claims the IC role. By the time anyone realizes "this is a P0," 20 minutes have passed and the customer-comm SLA is breached. Triage is 5 minutes; do not skip it.
The files¶
incident/
├── README.md # this file
├── severity-matrix.md # P0/P1/P2/P3 + response SLAs
├── pagerduty-integration.yaml # Alertmanager + AlertmanagerConfig
├── incident-channel-bot-config.md # PD -> Slack -> Zoom -> Statuspage
├── oncall-handoff-template.md # weekly Monday handoff doc
├── postmortem-template.md # blameless postmortem (with 5 Whys)
└── sample-postmortem-2026-04-15.md # fully worked example
severity-matrix.md (Triage)¶
The P0/P1/P2/P3 ladder. Defines the response-time SLA, customer-comm requirements, and postmortem rules for each severity. The on-call's first 5-minute question is "what severity is this?" — this doc has the answer.
pagerduty-integration.yaml (Detect)¶
The Alertmanager AlertmanagerConfig that routes alerts from
Prometheus to PagerDuty. Splits P0/P1/P2/P3 onto separate PagerDuty
services with different escalation policies. Includes:
- Route tree (severity-based)
- Inhibition rules (suppress lower-sev when higher-sev fires)
- Receiver configuration per severity
- A sample PrometheusRule showing the required labels +
annotations every alert must have
incident-channel-bot-config.md (Resolve)¶
How to wire PagerDuty -> Slack channel auto-creation -> Zoom bridge -> status-page update. Discusses the three off-the-shelf platforms (Incident.io, FireHydrant, Rootly) without prescribing one + the minimum-viable wire-up if you can't yet adopt a paid tool. The IC-claim flow is the most-overlooked piece; it gets its own section.
oncall-handoff-template.md (Resolve, continuity)¶
The Monday 10:00 UTC handoff doc. Closes the "the incoming on-call didn't know about the open issue from last week" anti-pattern. Structured sections: open incidents, recent postmortems, planned changes, known flapping alerts.
postmortem-template.md (Learn)¶
The blameless postmortem template. Sections: metadata, summary, customer impact, timeline, root cause (5 Whys), contributing factors, what went right, what went wrong, action items, lessons, related artefacts, customer communication, publication checklist, sign-off. The 5 Whys is explicit; the action-item table requires owner + ticket + due date + priority for every item.
sample-postmortem-2026-04-15.md (Learn, by example)¶
A fully worked sample postmortem: 39-min P0 checkout outage during the Spring Sale flash promotion. Shows the discipline the template asks for — dense timeline, 5 Whys all the way to a system-level cause, 8 action items with all 4 fields populated, what-went-right as long as what-went-wrong. Teaching material; read it once before writing your first real postmortem.
How this fits with the rest of the platform¶
- Detect: alerts fire from PrometheusRules (lives in
examples/bookstore-platform/observability/); routed by Alertmanager (config inpagerduty-integration.yamlhere). - Triage: the runbook for every alert lives in
../runbooks/. The on-call opens the runbook from the PagerDutyrunbook_urlannotation. - Resolve: the incident channel auto-created via the wire-up in
incident-channel-bot-config.md. The on-call walks the runbook's 5-section structure (Alert / Check / Diagnose / Mitigate / Postmortem). - Learn: the postmortem written from the timeline; action items tracked. The platform lead reviews closure rate monthly (chapter 15.11).
- Handoff: the Monday handoff doc (template here) closes the weekly continuity gap.
What this does NOT cover¶
This directory covers the incident side of operations. It does NOT cover:
- Day-to-day operational reviews — cost, capacity, scaling, on-call metrics — covered in chapter 15.11.
- Proactive practices — chaos game-days, DR drills — covered in
../runbooks/and chapter 13.12. - Change management — PR-to-prod lifecycle — covered in chapters 15.01-15.09.
The four phases in this directory are the reactive side of production operations; the proactive side is "do the work to make incidents rarer in the first place."
Maturity ladder¶
The four phases mature at different rates. A typical team progresses:
- First 30 days: Detect works (alerts route to PD); Triage is informal (Slack DM); Resolve is heroic (whoever's online); Learn is "we'll talk about it in the next standup."
- First 90 days: Detect well-tuned (alert hygiene review ongoing); Triage has the severity matrix; Resolve has a runbook per alert; Learn ships the first 3-5 postmortems.
- First 12 months: Detect runs in the green (page rate steady below 2/shift); Triage automated (incident channel auto-creates; IC claim prompt); Resolve has 80 % action-item closure; Learn is a habit (postmortems published within 5 days at >90 % rate).
- 2+ years: Detect has shadow-traffic + chaos coverage; Triage needs almost no human decisions (severity auto-assigned from the alert + service tier); Resolve is mostly auto-remediation for the common cases; Learn drives the platform roadmap (action items become the next quarter's features).
The bookstore platform v2 ships level 2 and points at level 3. Level 4 is where Netflix/Google live; we name it honestly as the graduation goal.