Incident severity matrix — Bookstore Platform v2¶
The contract between the platform team, the on-call rotation, and the business. Every alert that fires has a severity. Every severity has a response-time SLA + a customer-communication requirement + a postmortem rule. This matrix is the document the on-call reads first; everything else flows from here.
The matrix is the source of truth for Alertmanager routing
(incident/pagerduty-integration.yaml), PagerDuty escalation policies,
and the postmortem template's Severity field.
Severity ladder¶
(War room = synchronous video bridge — Zoom/Meet/Slack huddle — opened for P0 incidents.)
| Sev | Customer impact | Page? | Customer comm | War room | Postmortem | Examples |
|---|---|---|---|---|---|---|
| P0 | >=50 % of customers down OR data loss OR safety / regulatory | PagerDuty + phone call | 15 min on status page + email | within 30 min | mandatory; published within 5 business days | full-platform outage; checkout broken; tenant isolation breach; data corruption confirmed |
| P1 | single region down OR feature broken for ALL tenants OR tenant isolation suspected | PagerDuty page | 1 hour on status page | within 1 hour if not resolved | required within 5 business days | us-east down; search returning empty; recommendations down; webhook delivery >30 min lagged |
| P2 | single tenant affected OR workaround exists OR error rate elevated but bounded | PagerDuty low-urgency page (business hours) | on request | no | optional; team discretion | tenant X's search index lagging; tenant Y over budget warning; payments retries elevated |
| P3 | cosmetic OR internal-only OR maintenance reminder | Slack only | none | no | no | dashboard panel broken; backup retention warning; alert hygiene reminder |
>=50 % is the formal P0 threshold. In practice, any payments-cluster-wide
failure or any tenant-isolation breach is P0 regardless of the percentage,
because the regulatory exposure (PCI for payments; SOC 2 for isolation) is
non-negotiable.
Response-time SLA¶
| Sev | Acknowledge | Mitigate (customer-visible impact ends) | Postmortem published | All-clear declared by |
|---|---|---|---|---|
| P0 | 5 min | 30 min | 5 business days | Incident Commander |
| P1 | 15 min | 4 hours | 5 business days | On-call primary |
| P2 | 1 business day | next business day | n/a | On-call primary |
| P3 | next sprint planning | next sprint | n/a | Team owner |
The clock starts when PagerDuty pages. Acknowledge is the PD ack button —
not a Slack message; not a "saw it on my phone"; the literal API ack that
stops the escalation timer. Mitigate is when the customer-visible
impact is over; the root-cause fix can wait. Postmortem published means
posted in #bookstore-platform-postmortems, linked in the GitHub Wiki, and
action-items filed.
What "page" means¶
| Page channel | Severity routed | Quiet hours? | Escalation |
|---|---|---|---|
| PagerDuty high-urgency | P0 | never | primary -> secondary at 5 min -> platform lead at 15 min -> CTO at 30 min |
| PagerDuty page | P1 | never | primary -> secondary at 15 min -> platform lead at 1 hour |
| PagerDuty low-urgency | P2 | yes (suppress 22:00-06:00 local) | primary only; no escalation |
Slack #bookstore-alerts |
P3 | yes | none |
Quiet hours suppression applies only to P2 (the on-call's SLA is "next business day" anyway). P0 and P1 page at all hours; that is what on-call means.
Customer-communication requirements¶
Every P0 and P1 incident has a customer-communication artifact:
- Status page at
status.bookstore-platform.example.com— public, Statuspage / Statusgator / Cachet / Atlassian Statuspage style. Updated by the Incident Commander (P0) or on-call primary (P1). - Email to affected tenants — for P0 only; the IC drafts; legal + customer-success sign off; sent within 1 hour of the all-clear.
- Postmortem published — the public-facing version of the postmortem (sanitized of internal vendor names + dollar figures) goes on the status page within 7 business days of the incident.
The two customer-comm anti-patterns we explicitly forbid:
- The "silent fix" — the team rolls back, customer impact ends, and no status-page entry is ever created. Forbidden. Every P0/P1 must have a public-facing artifact; the audit trail is non-negotiable.
- The "all clear too soon" — a status-page entry that goes to "resolved" while customers are still seeing residual errors. The rule: status-page "resolved" requires a 15-minute clean-window of green metrics + customer verification (a tenant on Slack confirming).
Postmortem requirements (the 5-business-day rule)¶
The biggest postmortem anti-pattern is the "we'll write it next week" trap — next week never comes; the incident's details fade; the postmortem either never appears or appears two months later with half the timeline guessed from Slack search.
The rule: published in #bookstore-platform-postmortems within 5
business days of the all-clear, no exceptions. Concretely:
- Day 0 (incident day): the on-call drafts the timeline DURING the incident — every command run, every observation, with UTC timestamps.
- Day 1: draft Summary + Impact + Timeline; circulate to the IC and the on-call secondary.
- Day 2-3: root cause + contributing factors + action items.
- Day 4: review with the platform lead.
- Day 5: published.
If the deadline slips, the postmortem itself becomes a P3 alert
(postmortem_overdue) tracked by the platform lead. The 5-day rule is
strict because postmortems that miss the window almost never get
published. (We measured: at the 7-day mark, the publish rate dropped to
40 %; at the 14-day mark, 12 %. 5 days is the inflection point.)
48h vs. 5-day: Part 13 ch.12 sets a 48-hour deadline — that is the draft-by target (timeline + summary written while details are fresh). This section's 5-business-day deadline is the publish-by target (action items filed, owners assigned, platform lead signed off). Both deadlines apply; they operate at different granularities of the same discipline.
When a P-level changes mid-incident¶
Severities can escalate (P1 -> P0) or de-escalate (P0 -> P1) as new information arrives. The rules:
- Escalation (P1 -> P0): any responder can escalate; the page goes to the P0 escalation policy; war room opens; status-page entry upgraded. The on-call records the escalation in the timeline with the trigger (e.g. "16:42 UTC — we discovered the issue affects all tenants, not just acme-books; escalated to P0").
- De-escalation (P0 -> P1): requires the IC's explicit decision + agreement from the on-call primary. Recorded in the timeline. The customer-communication artifacts STAY at P0's bar even if the technical severity drops (the tenants who saw a status-page P0 banner need closure on that banner, not silent demotion).
Triggering events — which alert maps to which severity¶
The Alertmanager labels (in incident/pagerduty-integration.yaml) carry
the severity. The mapping from alert name to severity is reviewed
quarterly during alert-hygiene review.
| Alert | Default sev | Why |
|---|---|---|
BookstoreGatewayDown |
P0 | every customer affected; checkout path down |
BookstoreCheckoutErrorRateHigh |
P0 | payments-affecting; revenue + PCI exposure |
BookstoreTenantIsolationBreach |
P0 | regulatory; SOC 2 + tenant contract |
BookstoreRegionDown |
P1 | single region; multi-region active-active should absorb |
BookstoreSearchUnavailable |
P1 | feature broken cluster-wide; checkout still works |
BookstorePaymentsWebhookLag |
P1 | payments succeed; webhook delivery delayed (tenants notice) |
BookstoreCatalogP99Latency |
P1 | feature degradation; user-visible |
BookstoreDatabaseReplicationLag |
P1 | precursor to data-loss; CNPG sync replication issue |
BookstoreTenantBudgetExceeded |
P2 | single tenant; workaround = budget bump |
BookstoreNodeMemoryPressure |
P2 | Karpenter should self-heal; alert is the safety net |
BookstoreCertExpiringSoon |
P3 | cert-manager should auto-renew; reminder only |
BookstoreBackupRetentionDriftFromPolicy |
P3 | Velero retention mismatched with policy |
The full mapping lives in
../runbooks/ — each runbook header carries the alert
name + severity + the rationale.
On-call burnout protection¶
Pages-per-shift is itself a metric:
| Pages / shift | Status | Action |
|---|---|---|
| 0-2 | normal | none |
| 3-5 | elevated | mention at next handoff |
| 6-10 | high | mandatory alert-hygiene review in this sprint |
| 11-20 | critical | platform lead pauses non-essential alerts (Alertmanager silence with documented expiry); root-cause review |
| >20 | rotation broken | the rotation itself is a P1; platform team's next sprint is to fix the noise |
This metric ties to the on-call review cadence in chapter 15.11.
Review cadence¶
- Quarterly alert-hygiene review — every alert reviewed; noisy ones deleted; severities adjusted; thresholds re-tuned against actual page volume.
- Annual severity-matrix review — does the matrix still match the business? (After a major product release, the P0/P1 boundary often needs revising.)
Last reviewed: 2026-05-01. Next review: 2026-08-01. Owner: platform lead.