Chaos Game-Day — Quarterly Playbook¶
The quarterly chaos game-day. Hypothesis-driven, blast-radius-bounded, postmortem-closed. Runs against the staging cluster (never production at v2's maturity level). Five experiments + observation windows + postmortems.
Schedule¶
- Frequency: quarterly (every ~12 weeks).
- Duration: half-day (4 hours), with 30-minute experiment slots + 10-minute observation windows.
- Cluster: staging (
kind-bookstore-platform-staging); NEVER production. - Time: during business hours (10:00-14:00 UTC); the team is awake; we WANT the runbook to be exercised by humans.
Roles¶
- Game-day lead. Owns the day; runs the script; calls aborts.
- Observer. Watches the SLO dashboards; logs reactions; OK-or- abort calls.
- Runbook driver. Pretends to be the on-call; uses ONLY the runbook (no out-of-band knowledge); shows what works + what is ambiguous.
- Recorder. Captures the timeline; writes the postmortem stub.
- Team observers. Anyone else — silent unless asked.
Safety boundary¶
- Staging only.
- One experiment at a time.
- Maximum blast radius:
- Pod-level:
mode: one(one Pod). - Network: 1 minute of injected delay/loss.
- I/O: 1 minute of throttle.
- Region-loss: simulated by cordoning a node group, not a real region.
- Abort condition: a Prometheus query that fails the experiment early (e.g. "5xx rate > 50 % for 30 s" → abort).
- Manual abort: the game-day lead can call abort at any time.
The five experiments¶
Each experiment is declared in chaos-experiments.yaml (a Chaos Mesh
Workflow). Each has:
- A hypothesis ("storefront stays up").
- A safety boundary ("one Pod; 60 s").
- An abort condition (a Prometheus query).
- An observation window (10 min after the fault clears).
- A postmortem requirement (even if the experiment passed).
Experiment 1 — pod-kill (payments-gateway)¶
- Hypothesis: When one
payments-gatewayPod is killed inbookstore-platform-acme-books, the storefront's/healthzstays at 200 OK, and the catalog's p99 latency stays under 100 ms. The PodDisruptionBudget keepsminAvailable: 2. - Inject: Chaos Mesh
PodChaoswithaction: pod-kill, mode: one, duration: 60s. - Abort: 5xx rate > 5 % for 30 s.
- Observed: storefront stays up; PDB preserves replica count.
- Why it matters: rolling restarts + node failures look like this.
Experiment 2 — network-delay (catalog → orders)¶
- Hypothesis: When 200ms latency is injected on the catalog→orders network path, the catalog's p99 stays under 500 ms (the timeout + retry logic handles it).
- Inject: Chaos Mesh
NetworkChaoswithaction: delay, latency: 200ms, duration: 5m. - Abort: p99 > 2 s for 1 min.
- Observed: retries fire; p99 stays bounded; no 5xx spike.
- Why it matters: cross-region calls + noisy neighbours look like this.
Experiment 3 — io-stress (CNPG node)¶
- Hypothesis: When I/O stress is applied to the CNPG primary's node, the database's connection pool stays under 80 % usage; the catalog and orders services see < 1 s p99.
- Inject: Chaos Mesh
StressChaoswithstressors: io: workers: 4, duration: 2m. - Abort: CNPG replication lag > 60 s.
- Observed: I/O degraded; replication slows but doesn't break; read replicas serve some traffic.
- Why it matters: disk saturation in cloud-managed disks is common; this is the resilience drill.
Experiment 4 — region-DNS-loss (simulated cross-region outage)¶
- Hypothesis: When cross-region DNS resolution fails (a Chaos
Mesh
DNSChaosreturns NXDOMAIN for*.bookstore-platform.example.com), the saga compensates and the DR runbook completes in < 30 minutes; eu-west becomes primary; checkout works against eu-west. - Inject (primary): Chaos Mesh
DNSChaoswithaction: error, mode: all, patterns: ["*.bookstore-platform.example.com"], scoped to pods labeledapp.kubernetes.io/part-of: bookstore-platform. - Inject (complementary): the game-day lead also runs
kubectl cordonon all us-east nodes + drain — disrupting what the cluster sees of its own nodes, while DNSChaos disrupts what the application sees of remote regions. The two are designed to be exercised together. - Abort: runbook driver cannot recover in 60 minutes; the lead calls abort + restores.
- Observed: the runbook driver follows
dr-drill-script.sh; RTO is measured; RPO is measured. - Why it matters: the actual disaster scenario.
Experiment 5 — payments-gateway-failure (Stripe 500)¶
- Hypothesis: When the Stripe webhook returns 500 to 30 % of
callbacks, the saga compensation in payments-worker fires; failed
orders are rolled back; no order is left in
payment_pendingfor more than 5 minutes. - Inject: Custom toxiproxy / mock-stripe pod returning HTTP 500 with probability 0.3.
- Abort: payments-worker queue depth > 1000 (a backlog explosion).
- Observed: the saga's compensation transactions appear in the outbox; the failed orders are correctly rolled back.
- Why it matters: the payment failure mode the v1 toy could not reproduce; v2 must survive it.
Runbook drill component¶
For experiments 1, 2, 3, and 5, the runbook driver follows the relevant runbook ONLY:
- Experiment 1 →
runbook-api-latency-p99.md(the PDB held; the alert may not fire; the runbook should still document the scenario). - Experiment 2 →
runbook-api-latency-p99.md(timeout + retry path). - Experiment 3 →
runbook-database-replication-lag.md. - Experiment 5 →
runbook-payments-failure-rate.md.
The driver's job is to identify ambiguity in the runbook. If a step is unclear or the wrong command, that is an action item.
Observation windows¶
After each experiment clears, wait 10 minutes before starting the next. The observation window catches:
- Delayed effects (a backlog draining; a slow-burning alert).
- The "everything looks fine but a critical metric is silently wrong" case.
- The on-call's recovery (was the alert silenced? did it re-fire?).
Postmortem¶
For each experiment:
- Hypothesis met? Yes / No.
- Resilience control held? Which one — PDB / retry / timeout / saga.
- Runbook gaps found? Each gap → action item.
- Surprise findings? Anything unexpected.
- Action items: each with owner + ticket + due date.
Template:
postmortem-template.md. One postmortem
file per experiment; file at
runbooks/chaos-postmortem-$(date +%F)-experiment-N.md.
Trend tracking¶
Each quarterly game-day produces a row in the team's resilience dashboard:
| Quarter | Pass count | Fail count | Action items | Action-item closure rate |
|---|---|---|---|---|
| Q1 2026 | 4/5 | 1/5 | 7 | 5/7 closed by Q2 |
| Q2 2026 | 5/5 | 0/5 | 3 | 3/3 closed by Q3 |
| Q3 2026 | TBD | TBD | TBD | TBD |
Target: > 80 % action-item closure rate by the next game-day.
Maturity path¶
The maturity ladder for chaos in production (v2's current rung is rung 2):
- rung 1: chaos in kind / dev.
- rung 2: chaos in staging during business hours. (v2 today)
- rung 3: chaos in staging off-hours.
- rung 4: chaos in production during business hours, one experiment per quarter, customer-impact-review-required.
- rung 5: chaos in production automated (the Netflix Chaos Monkey shape).
Each rung is a discrete promotion; the team must hit > 90 % action-item-closure on the previous rung first.
Last game-day¶
- Date: 2026-04-15.
- Result: 4/5 pass (experiment 4 timed out; runbook needs work).
- Postmortem:
chaos-postmortem-2026-04-15.md(not shipped in this template; the file exists for each real game-day). - Next: 2026-07-15.
Owners¶
- Platform Lead: owns this playbook + the schedule.
- Game-day lead: rotates each quarter.
- Action items: the assigned owner; tracked in the platform's GitHub project board.