Incident severity matrix — Bookstore Platform v2¶

The contract between the platform team, the on-call rotation, and the business. Every alert that fires has a severity. Every severity has a response-time SLA + a customer-communication requirement + a postmortem rule. This matrix is the document the on-call reads first; everything else flows from here.

The matrix is the source of truth for Alertmanager routing (incident/pagerduty-integration.yaml), PagerDuty escalation policies, and the postmortem template's Severity field.

Severity ladder¶

(War room = synchronous video bridge — Zoom/Meet/Slack huddle — opened for P0 incidents.)

Sev	Customer impact	Page?	Customer comm	War room	Postmortem	Examples
P0	>=50 % of customers down OR data loss OR safety / regulatory	PagerDuty + phone call	15 min on status page + email	within 30 min	mandatory; published within 5 business days	full-platform outage; checkout broken; tenant isolation breach; data corruption confirmed
P1	single region down OR feature broken for ALL tenants OR tenant isolation suspected	PagerDuty page	1 hour on status page	within 1 hour if not resolved	required within 5 business days	us-east down; search returning empty; recommendations down; webhook delivery >30 min lagged
P2	single tenant affected OR workaround exists OR error rate elevated but bounded	PagerDuty low-urgency page (business hours)	on request	no	optional; team discretion	tenant X's search index lagging; tenant Y over budget warning; payments retries elevated
P3	cosmetic OR internal-only OR maintenance reminder	Slack only	none	no	no	dashboard panel broken; backup retention warning; alert hygiene reminder

>=50 % is the formal P0 threshold. In practice, any payments-cluster-wide failure or any tenant-isolation breach is P0 regardless of the percentage, because the regulatory exposure (PCI for payments; SOC 2 for isolation) is non-negotiable.

Response-time SLA¶

Sev	Acknowledge	Mitigate (customer-visible impact ends)	Postmortem published	All-clear declared by
P0	5 min	30 min	5 business days	Incident Commander
P1	15 min	4 hours	5 business days	On-call primary
P2	1 business day	next business day	n/a	On-call primary
P3	next sprint planning	next sprint	n/a	Team owner

The clock starts when PagerDuty pages. Acknowledge is the PD ack button — not a Slack message; not a "saw it on my phone"; the literal API ack that stops the escalation timer. Mitigate is when the customer-visible impact is over; the root-cause fix can wait. Postmortem published means posted in #bookstore-platform-postmortems, linked in the GitHub Wiki, and action-items filed.

What "page" means¶

Page channel	Severity routed	Quiet hours?	Escalation
PagerDuty high-urgency	P0	never	primary -> secondary at 5 min -> platform lead at 15 min -> CTO at 30 min
PagerDuty page	P1	never	primary -> secondary at 15 min -> platform lead at 1 hour
PagerDuty low-urgency	P2	yes (suppress 22:00-06:00 local)	primary only; no escalation
Slack `#bookstore-alerts`	P3	yes	none

Quiet hours suppression applies only to P2 (the on-call's SLA is "next business day" anyway). P0 and P1 page at all hours; that is what on-call means.

Customer-communication requirements¶

Every P0 and P1 incident has a customer-communication artifact:

Status page at status.bookstore-platform.example.com — public, Statuspage / Statusgator / Cachet / Atlassian Statuspage style. Updated by the Incident Commander (P0) or on-call primary (P1).
Email to affected tenants — for P0 only; the IC drafts; legal + customer-success sign off; sent within 1 hour of the all-clear.
Postmortem published — the public-facing version of the postmortem (sanitized of internal vendor names + dollar figures) goes on the status page within 7 business days of the incident.

The two customer-comm anti-patterns we explicitly forbid:

The "silent fix" — the team rolls back, customer impact ends, and no status-page entry is ever created. Forbidden. Every P0/P1 must have a public-facing artifact; the audit trail is non-negotiable.
The "all clear too soon" — a status-page entry that goes to "resolved" while customers are still seeing residual errors. The rule: status-page "resolved" requires a 15-minute clean-window of green metrics + customer verification (a tenant on Slack confirming).

Postmortem requirements (the 5-business-day rule)¶

The biggest postmortem anti-pattern is the "we'll write it next week" trap — next week never comes; the incident's details fade; the postmortem either never appears or appears two months later with half the timeline guessed from Slack search.

The rule: published in #bookstore-platform-postmortems within 5 business days of the all-clear, no exceptions. Concretely:

Day 0 (incident day): the on-call drafts the timeline DURING the incident — every command run, every observation, with UTC timestamps.
Day 1: draft Summary + Impact + Timeline; circulate to the IC and the on-call secondary.
Day 2-3: root cause + contributing factors + action items.
Day 4: review with the platform lead.
Day 5: published.

If the deadline slips, the postmortem itself becomes a P3 alert (postmortem_overdue) tracked by the platform lead. The 5-day rule is strict because postmortems that miss the window almost never get published. (We measured: at the 7-day mark, the publish rate dropped to 40 %; at the 14-day mark, 12 %. 5 days is the inflection point.)

48h vs. 5-day: Part 13 ch.12 sets a 48-hour deadline — that is the draft-by target (timeline + summary written while details are fresh). This section's 5-business-day deadline is the publish-by target (action items filed, owners assigned, platform lead signed off). Both deadlines apply; they operate at different granularities of the same discipline.

When a P-level changes mid-incident¶

Severities can escalate (P1 -> P0) or de-escalate (P0 -> P1) as new information arrives. The rules:

Escalation (P1 -> P0): any responder can escalate; the page goes to the P0 escalation policy; war room opens; status-page entry upgraded. The on-call records the escalation in the timeline with the trigger (e.g. "16:42 UTC — we discovered the issue affects all tenants, not just acme-books; escalated to P0").
De-escalation (P0 -> P1): requires the IC's explicit decision + agreement from the on-call primary. Recorded in the timeline. The customer-communication artifacts STAY at P0's bar even if the technical severity drops (the tenants who saw a status-page P0 banner need closure on that banner, not silent demotion).

Triggering events — which alert maps to which severity¶

The Alertmanager labels (in incident/pagerduty-integration.yaml) carry the severity. The mapping from alert name to severity is reviewed quarterly during alert-hygiene review.

Alert	Default sev	Why
`BookstoreGatewayDown`	P0	every customer affected; checkout path down
`BookstoreCheckoutErrorRateHigh`	P0	payments-affecting; revenue + PCI exposure
`BookstoreTenantIsolationBreach`	P0	regulatory; SOC 2 + tenant contract
`BookstoreRegionDown`	P1	single region; multi-region active-active should absorb
`BookstoreSearchUnavailable`	P1	feature broken cluster-wide; checkout still works
`BookstorePaymentsWebhookLag`	P1	payments succeed; webhook delivery delayed (tenants notice)
`BookstoreCatalogP99Latency`	P1	feature degradation; user-visible
`BookstoreDatabaseReplicationLag`	P1	precursor to data-loss; CNPG sync replication issue
`BookstoreTenantBudgetExceeded`	P2	single tenant; workaround = budget bump
`BookstoreNodeMemoryPressure`	P2	Karpenter should self-heal; alert is the safety net
`BookstoreCertExpiringSoon`	P3	cert-manager should auto-renew; reminder only
`BookstoreBackupRetentionDriftFromPolicy`	P3	Velero retention mismatched with policy

The full mapping lives in ../runbooks/ — each runbook header carries the alert name + severity + the rationale.

On-call burnout protection¶

Pages-per-shift is itself a metric:

Pages / shift	Status	Action
0-2	normal	none
3-5	elevated	mention at next handoff
6-10	high	mandatory alert-hygiene review in this sprint
11-20	critical	platform lead pauses non-essential alerts (Alertmanager silence with documented expiry); root-cause review
>20	rotation broken	the rotation itself is a P1; platform team's next sprint is to fix the noise

This metric ties to the on-call review cadence in chapter 15.11.

Review cadence¶

Quarterly alert-hygiene review — every alert reviewed; noisy ones deleted; severities adjusted; thresholds re-tuned against actual page volume.
Annual severity-matrix review — does the matrix still match the business? (After a major product release, the P0/P1 boundary often needs revising.)

Last reviewed: 2026-05-01. Next review: 2026-08-01. Owner: platform lead.