On-Call Rotation — Bookstore Platform v2¶
The on-call policy. Rotations + escalation + severity definitions + the response-time SLA. This file is the contract between the platform team and the team-on-call; changes require a PR + sign-off from the platform lead.
Rotation pattern¶
- Primary + secondary, 1-week shifts, Monday 10:00 UTC handoff.
- 5-engineer rotation — on-call frequency = once every 5 weeks. Below 4 engineers = burnout risk; above 10 engineers = skills decay.
- Time-off-during-shift policy: notify the secondary, push the handoff if needed, no Slack expectations off-shift.
Severity definitions¶
| Sev | Customer impact | Page? | Customer comm | Postmortem | Examples |
|---|---|---|---|---|---|
| P0 | >=50 % of customers OR data loss OR safety/regulatory | yes | 15 min | mandatory | full outage; tenant isolation breach |
| P1 | single region OR feature broken for all tenants OR isolation issue | yes | 1 hour | within 48 h | us-east down; search broken; replication lag |
| P2 | single tenant OR workaround exists OR cosmetic | slack | on-request only | not required | tenant X over-budget; UI glitch |
Response-time SLA¶
| Sev | Acknowledge | Mitigate | Postmortem |
|---|---|---|---|
| P0 | 5 min | 30 min | 48 h |
| P1 | 15 min | 4 h | 48 h |
| P2 | 1 business day | when convenient | n/a |
The clock starts when PagerDuty pages; "acknowledge" is the PD ack button. "Mitigate" = the customer-visible impact is over; the root-cause fix can wait for the postmortem.
Escalation¶
- Primary pages.
- If primary does not ack in 15 min → Secondary pages.
- If secondary does not ack in 15 min → Platform Lead pages.
- If platform lead does not ack in 15 min → CTO pages.
PagerDuty schedule ID: PXXXXX-bookstore-platform-primary
PagerDuty escalation policy: PYYYYY-bookstore-platform-escalation
Handoff meeting¶
Every Monday at 10:00 UTC. Outgoing primary briefs incoming primary on:
- Open incidents (any unclosed in PagerDuty).
- Recent postmortems + action items.
- Known issues that may page during the shift.
- Scheduled changes during the week (deploys, DR drills, chaos game-days).
Handoff doc template:
handoff-template.md — not shipped here;
generated each week by the platform team.
Off-shift expectations¶
- No Slack response expected. The on-call's Slack is paged via PagerDuty integration; the rest is opt-in.
- No
@team-platformpings off-shift. The escalation policy exists; use it. - Holidays + PTO covered. The platform lead arranges coverage; the engineer on PTO is removed from the rotation for the week.
Postmortem culture¶
Blameless. The template is in postmortem-template.md. The discipline:
- Required within 48 h for any P0 or P1.
- Action items have an owner + a ticket + a due date.
- Published widely (Slack + the platform's GitHub Wiki + the quarterly all-hands review).
-
80 % action-item-completion rate is the team metric.
Page-volume alerting¶
If the on-call rotation receives > 10 pages per shift, the rotation itself becomes a P2 alert (the alerting is broken; the platform team's next sprint is to fix it). Mitigations:
- Quarterly alert hygiene review. Every alert reviewed; noisy ones deleted; burn-rate thresholds re-tuned.
- Load shedding. If volume > 20 pages / shift, the platform lead pauses non-essential alerts via Alertmanager silence (with a documented expiry).
Rotation roster¶
The active rotation is in PagerDuty (schedule URL). Mirrored here as static documentation for the audit trail; PagerDuty is the source of truth.
| Week of | Primary | Secondary |
|---|---|---|
| 2026-05-19 | alice@team-payments | bob@team-catalog |
| 2026-05-26 | carol@team-search | dave@team-platform |
| 2026-06-02 | eve@team-catalog | alice@team-payments |
| 2026-06-09 | bob@team-catalog | carol@team-search |
| 2026-06-16 | dave@team-platform | eve@team-catalog |
Follow-the-sun (future)¶
The current pattern works at < 50 engineers in one to two time zones. Above that, the platform graduates to follow-the-sun:
- us-east team: 14:00-22:00 UTC.
- eu-west team: 06:00-14:00 UTC.
- ap-southeast team: 22:00-06:00 UTC.
Each region maintains its own primary + secondary rotation; pages route by time of day. This requires three independently-funded regional teams; not v2's scope.
Owners¶
- Platform Lead: owns this document.
- PagerDuty admin: owns the schedule + escalation policy.
- The rotation: owns the pages.
Last reviewed: 2026-05-01. Next review: 2026-08-01.