`incident/` — Incident response artefacts¶

The files in this directory operationalize the incident-response lifecycle: detect → triage → resolve → learn. Each file maps to one phase + the human role that owns it. The chapters 15.10 and 15.11 are the walkthroughs; the files here are the artefacts you run.

The four phases¶

The lifecycle has four phases. Confusing them is the most common incident-response failure mode in our experience.

Phase	Goal	Owner	Time scale	Artefact in this dir
Detect	Discover the issue exists	Alertmanager + PD	seconds	`pagerduty-integration.yaml`
Triage	Decide severity + claim IC	On-call primary + IC	first 5 min	`severity-matrix.md`
Resolve	Stop the customer-visible impact	On-call + responders	5 min - 4 hrs	`incident-channel-bot-config.md`
Learn	Prevent the same incident recurring	IC + author + team	days to weeks	`postmortem-template.md` + `sample-postmortem-2026-04-15.md`

Detect is automated; humans don't show up here unless the page fires. Triage is where the on-call enters the loop. Resolve is where the team enters the loop. Learn is where the organization enters the loop.

The most common mistake: jumping from Detect straight to Resolve, skipping Triage. The page fires; the on-call starts debugging; nobody declares the severity, nobody opens a status-page entry, nobody claims the IC role. By the time anyone realizes "this is a P0," 20 minutes have passed and the customer-comm SLA is breached. Triage is 5 minutes; do not skip it.

The files¶

incident/
├── README.md                            # this file
├── severity-matrix.md                   # P0/P1/P2/P3 + response SLAs
├── pagerduty-integration.yaml           # Alertmanager + AlertmanagerConfig
├── incident-channel-bot-config.md       # PD -> Slack -> Zoom -> Statuspage
├── oncall-handoff-template.md           # weekly Monday handoff doc
├── postmortem-template.md               # blameless postmortem (with 5 Whys)
└── sample-postmortem-2026-04-15.md      # fully worked example

`severity-matrix.md` (Triage)¶

The P0/P1/P2/P3 ladder. Defines the response-time SLA, customer-comm requirements, and postmortem rules for each severity. The on-call's first 5-minute question is "what severity is this?" — this doc has the answer.

`pagerduty-integration.yaml` (Detect)¶

The Alertmanager AlertmanagerConfig that routes alerts from Prometheus to PagerDuty. Splits P0/P1/P2/P3 onto separate PagerDuty services with different escalation policies. Includes: - Route tree (severity-based) - Inhibition rules (suppress lower-sev when higher-sev fires) - Receiver configuration per severity - A sample PrometheusRule showing the required labels + annotations every alert must have

`incident-channel-bot-config.md` (Resolve)¶

How to wire PagerDuty -> Slack channel auto-creation -> Zoom bridge -> status-page update. Discusses the three off-the-shelf platforms (Incident.io, FireHydrant, Rootly) without prescribing one + the minimum-viable wire-up if you can't yet adopt a paid tool. The IC-claim flow is the most-overlooked piece; it gets its own section.

`oncall-handoff-template.md` (Resolve, continuity)¶

The Monday 10:00 UTC handoff doc. Closes the "the incoming on-call didn't know about the open issue from last week" anti-pattern. Structured sections: open incidents, recent postmortems, planned changes, known flapping alerts.

`postmortem-template.md` (Learn)¶

The blameless postmortem template. Sections: metadata, summary, customer impact, timeline, root cause (5 Whys), contributing factors, what went right, what went wrong, action items, lessons, related artefacts, customer communication, publication checklist, sign-off. The 5 Whys is explicit; the action-item table requires owner + ticket + due date + priority for every item.

`sample-postmortem-2026-04-15.md` (Learn, by example)¶

A fully worked sample postmortem: 39-min P0 checkout outage during the Spring Sale flash promotion. Shows the discipline the template asks for — dense timeline, 5 Whys all the way to a system-level cause, 8 action items with all 4 fields populated, what-went-right as long as what-went-wrong. Teaching material; read it once before writing your first real postmortem.

How this fits with the rest of the platform¶

Detect: alerts fire from PrometheusRules (lives in examples/bookstore-platform/observability/); routed by Alertmanager (config in pagerduty-integration.yaml here).
Triage: the runbook for every alert lives in ../runbooks/. The on-call opens the runbook from the PagerDuty runbook_url annotation.
Resolve: the incident channel auto-created via the wire-up in incident-channel-bot-config.md. The on-call walks the runbook's 5-section structure (Alert / Check / Diagnose / Mitigate / Postmortem).
Learn: the postmortem written from the timeline; action items tracked. The platform lead reviews closure rate monthly (chapter 15.11).
Handoff: the Monday handoff doc (template here) closes the weekly continuity gap.

What this does NOT cover¶

This directory covers the incident side of operations. It does NOT cover:

Day-to-day operational reviews — cost, capacity, scaling, on-call metrics — covered in chapter 15.11.
Proactive practices — chaos game-days, DR drills — covered in ../runbooks/ and chapter 13.12.
Change management — PR-to-prod lifecycle — covered in chapters 15.01-15.09.

The four phases in this directory are the reactive side of production operations; the proactive side is "do the work to make incidents rarer in the first place."

Maturity ladder¶

The four phases mature at different rates. A typical team progresses:

First 30 days: Detect works (alerts route to PD); Triage is informal (Slack DM); Resolve is heroic (whoever's online); Learn is "we'll talk about it in the next standup."
First 90 days: Detect well-tuned (alert hygiene review ongoing); Triage has the severity matrix; Resolve has a runbook per alert; Learn ships the first 3-5 postmortems.
First 12 months: Detect runs in the green (page rate steady below 2/shift); Triage automated (incident channel auto-creates; IC claim prompt); Resolve has 80 % action-item closure; Learn is a habit (postmortems published within 5 days at >90 % rate).
2+ years: Detect has shadow-traffic + chaos coverage; Triage needs almost no human decisions (severity auto-assigned from the alert + service tier); Resolve is mostly auto-remediation for the common cases; Learn drives the platform roadmap (action items become the next quarter's features).

The bookstore platform v2 ships level 2 and points at level 3. Level 4 is where Netflix/Google live; we name it honestly as the graduation goal.

incident/ — Incident response artefacts¶

The four phases¶

The files¶

severity-matrix.md (Triage)¶

pagerduty-integration.yaml (Detect)¶

incident-channel-bot-config.md (Resolve)¶

oncall-handoff-template.md (Resolve, continuity)¶

postmortem-template.md (Learn)¶

sample-postmortem-2026-04-15.md (Learn, by example)¶