Postmortem — Checkout 5xx Spike During Spring Sale (INC-2026-04-15-001)¶
Reader's guide: this is a fully worked example postmortem, written against the postmortem template, for a real- shape incident on the Bookstore Platform. It is teaching material: read it side-by-side with the template to see what "good" looks like. Names, tickets, dollar figures are illustrative.
Metadata¶
- Incident ID: INC-2026-04-15-001
- Title: Checkout 5xx spike during Spring Sale flash promotion
- Date / Time (UTC):
- First alert: 2026-04-15 14:03 UTC
- All-clear: 2026-04-15 14:42 UTC
- Duration of customer-visible impact: 39 minutes
- Severity: P0 (>=50 % of customers affected; checkout broken; revenue-bearing path)
- Affected tenants: all (
acme-books,globex-press,initech-reads,umbrella-publishing, plus 6 free-tier tenants) - Affected regions: us-east-1 (primary), us-west-2 (cascaded via shared ALB)
- Affected services: payments-gateway, checkout-orchestrator, storefront (degraded UX from upstream errors)
- Detection method: automated alert (
BookstoreCheckoutErrorRateHigh) - Incident Commander (IC): Carol Mendez @ platform-team
- On-call primary at time of incident: Bob Park @ team-payments
- On-call secondary at time of incident: Dave Lin @ team-platform
- Recorder (timeline scribe): Eve Nakamura @ team-payments
- Author of this postmortem: Carol Mendez
Summary¶
Between 14:03 and 14:42 UTC on 2026-04-15, the Bookstore Platform's
checkout path returned HTTP 502 for an estimated 78 % of requests in
us-east-1, cascading to us-west-2 via the shared payments-gateway
deployment. Root cause: the payments-gateway pods were OOMKilled by
the kernel after a flash-promotion traffic spike pushed cart payloads
past the gateway's in-memory parser buffer; the per-request size limit
was 64 MB on the gateway but 1 MB on the storefront, allowing
oversized carts to reach the gateway. Approximately 4,200 checkout
attempts failed; estimated $58,000 in deferred revenue and $12,000 in
SLA credits owed to enterprise tenants. We mitigated by scaling the
payments-gateway deployment from 4 to 12 replicas and reducing the
request-body limit to 1 MB at the gateway. Prevention: an admission-
controller invariant check across the storefront -> gateway
request-size contract (Action Item A1).
Customer impact¶
- Number of customers affected: ~4,200 unique customers across all 10 tenants (we measure unique session IDs that hit
POST /api/checkoutand received a 502 between 14:03 and 14:42 UTC). - Tenants affected (named):
acme-books(Enterprise tier; got proactive email at 14:35)globex-press(Enterprise tier; got proactive email at 14:35)initech-reads(Enterprise tier; got proactive email at 14:38)umbrella-publishing(Enterprise tier; got proactive email at 14:38)- 6 free-tier tenants (no proactive comm per their contract; visible on status page)
- Revenue impact:
- Deferred revenue: ~$58,000 (4,200 attempts × average cart $13.80)
- SLA credits owed: ~$12,000 (Enterprise tier 99.95% SLA breach; 39 min > 21-min monthly budget)
- Refund / reorder discounts offered: ~$3,400 (10 % "we're sorry" discount on first 250 customers who emailed support)
- Data impact: None. No orders entered an inconsistent state. The outbox-pattern (ch.13.06) means failed checkouts never write to the orders table; Stripe was not charged for any of the failed attempts (verified via Stripe dashboard reconciliation 2026-04-16). The saga-compensation flow was not invoked because no payment intent was ever created.
- SLO impact: Checkout's monthly error budget (allowed: 21 minutes of >0.1 % errors; 99.95 % SLO) was consumed entirely + overspent by 18 minutes. Error budget alert (
BookstoreCheckoutSLOBurnRate) fired at 14:08 and again at 14:21. Budget recovery: ~28 days. - Status-page entry: https://status.bookstore-platform.example.com/incidents/2026-04-15-checkout-degraded (link illustrative).
- Customer-comm artifacts: 4 enterprise emails sent (Carol drafted, customer-success reviewed, sent within 1 hour of all-clear); 1 public-facing status-page entry with 3 updates; 38 support tickets opened, all responded to within 2 hours.
Timeline (UTC)¶
| Time | Event | Source / link |
|---|---|---|
| 13:55 | Spring Sale promotion went live (planned event; runbook noted "expect 3x traffic for 2 hours") | marketing-calendar #1842 |
| 13:58 | Storefront traffic ramped from 800 RPS to 2,400 RPS in 3 minutes | Grafana ramp |
| 14:01 | First oversized cart submitted: customer with 1,847 items in cart → 47 MB JSON | Loki query |
| 14:02 | First OOMKill on payments-gateway pod payments-gateway-7d4f-q2p8 |
k8s event |
| 14:03 | BookstoreCheckoutErrorRateHigh alert fired (P0; error rate breached 5 % threshold) |
Alert rule |
| 14:03 | PagerDuty paged primary Bob Park (high-urgency) | PD incident Q7N8K2 |
| 14:04 | Bob acked the page from his phone | PD ack at 14:04:18 |
| 14:05 | Bob opened runbook-payments-failure-rate.md |
runbook link |
| 14:06 | Bob ran kubectl get pods -n bookstore-platform-payments — saw 4 pods, 3 of them with restartCount ≥ 2 in the last 5 min |
runbook Step 1 PASS (real alert, not flapping) |
| 14:07 | Bob created incident Slack channel #inc-2026-04-15-001 |
Slack channel |
| 14:08 | BookstoreCheckoutSLOBurnRate fired (burn rate >14.4× SLO; 1-hour budget consumed in 4 min) |
Alert |
| 14:09 | Bob paged the secondary, Dave Lin: "need a hand; checkout is OOMing" | PD escalation manual |
| 14:09 | Status-page entry created (manual; Bob): "Investigating checkout errors" | status-page v1 |
| 14:11 | Dave joined; began Step 2 (Diagnose) of the runbook — checked Grafana payments dashboard | Grafana |
| 14:11 | Dave observed: pod memory plot showed 4 GB peak (limit was 2 GB) before OOMKill | memory chart |
| 14:12 | Bob announced "I'm escalating to a war room; please respect the IC role" — IC role unclaimed at this point | Slack message |
| 14:13 | Carol Mendez (platform lead) joined the Slack channel after seeing the P0 in #platform-leadership | Slack |
| 14:14 | Carol took the IC role: "Carol IC. Bob, you're on diagnosis. Dave, on mitigation. Eve, recording. Status comms through me." | Slack message |
| 14:15 | Zoom war room opened: https://zoom.us/j/... (Incident.io auto-created) |
Zoom |
| 14:16 | Carol posted status-page update v2: "Checkout currently degraded; investigating" | status-page v2 |
| 14:18 | Dave proposed mitigation: scale payments-gateway from 4 → 12 replicas (more capacity to absorb OOM-driven restarts) |
Slack |
| 14:19 | Carol approved; Bob ran kubectl scale deployment payments-gateway --replicas=12 -n bookstore-platform-payments |
Slack + terminal |
| 14:20 | Karpenter provisioned 1 new node to fit the additional pods (~90 sec) | Karpenter event |
| 14:22 | 12 replicas all Running; error rate dropped from 78 % to 31 % |
Grafana |
| 14:24 | Bob noticed in the Loki log stream: "request body too large to parse: 47 MB"; recognized the oversized-cart pattern | Loki query |
| 14:26 | Bob proposed a second mitigation: reduce payments-gateway body-size limit from 64 MB to 1 MB (matches storefront) |
Slack |
| 14:27 | Carol authorized; Bob edited the ConfigMap, did a rolling restart | kubectl edit cm payments-gateway-config; kubectl rollout restart deployment payments-gateway |
| 14:30 | Restart complete; error rate began dropping below 5 % | Grafana |
| 14:31 | Carol posted status-page v3: "Mitigation deployed; monitoring" | status-page v3 |
| 14:33 | Error rate at 0.4 % (within SLO budget for steady-state); back to expected levels | Grafana |
| 14:35 | Carol began drafting customer email; customer-success reviewed | Email draft |
| 14:38 | 4 enterprise emails sent to acme-books, globex-press, initech-reads, umbrella-publishing | Email logs |
| 14:42 | 12-min clean window of green metrics complete; Carol declared all-clear | Slack message |
| 14:42 | Status-page resolved entry posted | status-page v4 |
| 14:55 | Incident Slack channel pinned for postmortem; Eve finalized timeline | Slack |
| 14:58 | Postmortem doc started (this document) — first draft outline within 16 minutes of all-clear | This file |
Root cause (5 Whys)¶
Problem: Checkout returned 5xx for 39 minutes during the Spring Sale promotion, affecting ~4,200 customers across all 10 tenants.
Why 1 — Why did checkout return 5xx?
The payments-gateway pods were OOMKilled by the kernel within 5
seconds of receiving certain requests; the kernel killed the process,
the kubelet restarted the pod, but during the 30-second restart window
the request load (~2,400 RPS) overwhelmed the remaining 3 pods, which
also began OOMKilling. The error rate spiked because requests routed
to pods that were dying or restarting received TCP RSTs (502s at the
ingress).
Why 2 — Why were the pods OOMKilled?
Each pod's memory usage spiked to ~4 GB (limit: 2 GB) when parsing
oversized JSON cart payloads. The payments-gateway parses the
entire cart payload in memory before validation; a 47 MB JSON payload
with deeply nested item structures expanded to ~3.2 GB in the
in-memory representation. Combined with the gateway's normal ~600 MB
working-set, this exceeded the 2 GB limit.
Why 3 — Why was a 47 MB cart payload reaching the gateway?
The storefront enforces a 1 MB request-body limit (cart size,
inclusive of headers), but the payments-gateway accepted bodies up
to 64 MB (the Go http.MaxBytesReader default). The customer's cart
was assembled via the API directly (bypassing the storefront UI;
probably a misconfigured tenant integration) and the cart payload
contained 1,847 line items with full metadata = 47 MB. The gateway
saw it, accepted it (under its 64 MB limit), and tried to parse it.
Why 4 — Why did the gateway have a higher limit than the storefront?
The 1 MB limit existed in the storefront from day 1. The
payments-gateway was rewritten in Q1 2026 (the Go service that
replaced the legacy Java implementation) and the new service used the
Go http.MaxBytesReader default (64 MB) instead of being configured
to match the storefront's limit. The rewrite PR was reviewed and
approved; the reviewers did not catch the limit drift because the
limit configuration was not in either service's contract definition.
Why 5 — Why didn't we catch the cross-service drift? There is no mechanism to enforce cross-service configuration invariants. The storefront's request-size limit and the payments-gateway's request-size limit are independent configurations maintained by independent teams. Code reviews focus on per-service changes; cross-service invariants (e.g. "request-size limit at the edge must be ≥ request-size limit at every downstream service") are invisible to the reviewer of a single PR. We have similar invariants that ARE enforced — e.g. the OpenAPI spec generation pipeline enforces that every API exposed by the gateway is documented — but we have no configuration-invariant checker.
Root cause statement:
The Bookstore Platform has no mechanism to enforce cross-service
configuration invariants. The payments-gateway Q1 rewrite drifted
from the storefront's 1 MB request-size limit to the Go default of 64
MB; the drift was invisible to reviewers because no test, lint, or
admission check exists for cross-service invariants. The Spring Sale
traffic spike + a single oversized cart was the trigger; the missing
invariant check was the cause.
Contributing factors¶
- Contributing factor 1 — pod memory limit too tight for the actual
workload. Even if the request-size limit had matched, the
payments-gateway's 2 GB memory limit was set 18 months ago for a smaller request profile. The Q1 rewrite increased the working-set size; nobody re-VPA'd the limit. A 4 GB limit would have absorbed the oversized request without OOMKilling (though the request would still have been malformed). - Contributing factor 2 — IC role unclaimed for 12 minutes (14:02 to 14:14). Bob attempted to debug AND coordinate AND communicate, all three poorly. The status-page entry was 13 minutes late (the P0 SLA is "within 15 minutes"; we cleared the bar by 2 minutes). The proper IC pattern is one of the responders explicitly takes the IC role within 5 minutes of the page.
- Contributing factor 3 — chaos game-day gap. The chaos workflow
has a
payments-failureexperiment (HTTP 500 from Stripe webhook) and apod-killexperiment (kills 1 pod), but no experiment for "pod OOMKilled under sustained load." The OOMKill blast radius pattern is meaningfully different from a pod-kill: in OOMKill, the TARGET pod is unhealthy AND the load that killed it is still arriving at the survivors.
Why we did not catch this earlier¶
- The alert that should have caught this: A
BookstorePodOOMKilledRecentalert would have fired on the FIRST OOMKill at 14:02 — a full minute before the customer-visible error rate breached the SLO. We do not have this alert. (Action item A2.) - The dashboard panel that would have shown this: The
bookstore-paymentsGrafana dashboard has a memory-usage panel, but no OOMKill-event panel. The on-call relies onkubectl get eventsduring triage rather than seeing OOMKill events in real time on the dashboard. (Action item A3.) - The chaos experiment that would have surfaced this: A
StressChaos experiment that consumes memory inside the
payments-gatewaypod (up to 95 % of the limit) under a load-test driver would have OOMKilled the pod in staging weeks ago. We do not have this experiment. (Action item A4.) - The pre-launch readiness check that should have run: A flash promotion of expected 3x traffic should have triggered a load-test rehearsal in staging; the standard promotion-readiness checklist was not run because Spring Sale was "just another promotion." (Action item A5.)
What went right¶
- The outbox pattern saved us from data corruption. No order made it into a half-committed state; Stripe was never charged for any failed checkout. The architectural decision in ch.13.06 paid off exactly as designed.
- The runbook
runbook-payments-failure-rate.mdStep 1 (Check) and Step 2 (Diagnose) led Bob to the OOMKill pattern within 9 minutes — fast enough that a competent on-call who had never seen this failure mode could still resolve it without expert help. - Karpenter provisioned the new node in 90 seconds to absorb the scaled-up replica count. The cluster autoscaling design (Part 10 ch.06 — node autoscaling) delivered on its promised behaviour.
- The two-mitigation approach worked. Mitigation 1 (scale-up) reduced the blast radius from 78 % to 31 % within 3 minutes; mitigation 2 (size limit reduction) closed the remaining 31 % within 5 minutes. Mitigation 1 alone wouldn't have been enough; the team correctly identified that more capacity wouldn't fix the underlying parser issue and added mitigation 2.
- Carol took the IC role explicitly at 14:14 and the response improved immediately — status-page updates, customer email, recorder role all flowed from a single coordinator.
- The status-page communication was prompt enough that no enterprise tenant escalated to support before Carol's proactive email reached them.
What went wrong¶
- No alert for OOMKill events. The most important monitoring gap. We learned about the OOMKill at 14:02 via the cascading 5xx alert at 14:03; with a direct OOMKill alert we would have learned at 14:02, 1 minute earlier.
- IC role unclaimed for 12 minutes. The runbook says "if you can't both diagnose AND communicate, escalate to claim an IC"; Bob did not claim, and no other responder did, until Carol joined.
- Status-page entry was 13 minutes late vs. the 15-min SLA target. Inside the SLA but barely; if a similar incident took 4 more minutes to set up, we would have breached our customer-communication contract.
- The 64 MB request-body limit lived for 4 months between the Q1 rewrite and this incident, invisible to every reviewer of every PR in that period. The PR reviewer pattern does not catch cross-service drift.
- No flash-promotion load test was run. The marketing calendar knew Spring Sale was coming; the platform team did not coordinate with marketing on a pre-promotion load-test rehearsal.
- Dashboard didn't show OOMKills. Memory-usage panel showed the
spike RETROACTIVELY but not the OOMKill event itself; Bob had to
kubectl get eventsto find them.
Action items¶
| ID | Description | Owner | Ticket | Due | Priority | Status |
|---|---|---|---|---|---|---|
| A1 | Write a CI lint that asserts storefront.maxRequestBytes <= payments-gateway.maxRequestBytes |
Bob Park | PLAT-1284 | 2026-04-29 | P1 | open |
| A2 | Add BookstorePodOOMKilledRecent PrometheusRule (alerts within 60 sec of any OOMKill in bookstore-platform-* namespaces) |
Dave Lin | PLAT-1285 | 2026-04-22 | P1 | open |
| A3 | Add an OOMKill-event panel to the bookstore-payments Grafana dashboard |
Eve Nakamura | PLAT-1286 | 2026-04-22 | P2 | open |
| A4 | Add a StressChaos memory experiment to the chaos workflow (95 % of memory limit; observe whether pod self-heals or OOMKills) |
Bob Park | PLAT-1287 | 2026-05-13 | P3 | open |
| A5 | Create a pre-promotion readiness checklist (load-test, capacity review, runbook walkthrough) and add it to the marketing-platform coordination doc | Carol Mendez | PLAT-1288 | 2026-05-06 | P2 | open |
| A6 | Re-VPA the payments-gateway memory limit based on the new working-set profile from the rewrite |
Bob Park | PLAT-1289 | 2026-04-29 | P2 | open |
| A7 | Add a "claim the IC role" Slack bot trigger on every P0 page (Incident.io action incident_commander_check) |
Dave Lin | PLAT-1290 | 2026-05-13 | P2 | open |
| A8 | Document the IC role + claim discipline more prominently in runbook-payments-failure-rate.md and link from every P0 runbook |
Carol Mendez | PLAT-1291 | 2026-04-22 | P3 | open |
Closure target: > 80 % closure by the quarterly review on 2026-07-15. The platform lead (Carol) reports closure rate at the monthly postmortem review (see ch.15.11 — postmortem review).
Lessons¶
- Cross-service configuration invariants need automated enforcement. This is not a code-review failure; it is a system design failure. Action item A1 ships the first such check; we will inventory the other cross-service invariants in PLAT-1292 (separate ticket) and ship checks for each.
- OOMKill is a distinct failure mode that deserves its own alert. We had alerts for "pod restarting too often" but not for "pod OOMKilled" — the two failure modes overlap but are not identical. OOMKill alerts give us 30-60 seconds of head start over the cascading 5xx alert.
- The IC role must be claimed within 5 minutes of a P0 page, by ANY responder. Bob did not claim; Carol joined and claimed. The next P0 should not depend on a senior person happening to be online; the on-call's runbook must call this out explicitly.
- Pre-promotion load tests are part of platform-team responsibility, even when marketing owns the promotion. The platform team is the one who pays the page; the platform team is the one who coordinates load-test rehearsals.
- The PDB / replicaCount / outbox pattern combo worked. The resilience controls did exactly what they were designed to do; the failure was a configuration drift the controls could not see.
Related artefacts¶
- Runbooks consulted:
runbook-payments-failure-rate.md - Dashboards consulted:
https://grafana.bookstore-platform.example.com/d/bookstore-payments(illustrative) - Alerts that fired:
BookstoreCheckoutErrorRateHigh(14:03 UTC),BookstoreCheckoutSLOBurnRate(14:08 UTC),BookstoreCheckoutSLOBurnRate(14:21 UTC re-fire) - PagerDuty incident:
Q7N8K2(https://your-org.pagerduty.com/incidents/Q7N8K2) - Incident Slack channel:
#inc-2026-04-15-001 - War-room Zoom recording: stored in Drive; access on request
- Commits / PRs related:
- Q1 rewrite that introduced the 64 MB default: PR #4128 (2026-01-23)
- Mitigation: ConfigMap edit recorded in
runbooks/runbook-payments-failure-rate.md - Fix forward: PR #5921 (2026-04-16; sets explicit 1 MB limit)
- Chaos experiments related: existing
payments-failure(HTTP 500); proposedpayments-oom-stress(Action A4)
Customer communication¶
- Status-page entries:
- v1 (14:09): "Investigating checkout errors"
- v2 (14:16): "Checkout currently degraded; investigating"
- v3 (14:31): "Mitigation deployed; monitoring"
- v4 (14:42): "Resolved"
- Email to affected tenants: drafted 14:35, final 14:36, sent 14:38 to enterprise tenants (acme-books, globex-press, initech-reads, umbrella-publishing) — recipient list: each tenant's incident-contact per CRM.
- Support tickets opened by customers: 38 total; 100 % responded to within 2 hours; 250 customers offered a 10 % discount on next purchase.
- Public-facing postmortem published? Yes — sanitized version published to status page on 2026-04-21 (6 business days, slightly past target; tracked as a process improvement under A5).
Publication checklist¶
- Posted in
#bookstore-platform-postmortemsSlack channel (2026-04-19). - Linked in the platform's GitHub Wiki /
docs/postmortems/INC-2026-04-15-001.md. - Filed in the platform's GitHub Project board (postmortem-tracking).
- Action items have owners + due dates + tracking tickets.
- Discussed at the platform all-hands on 2026-04-23.
- Public-facing version posted to status page on 2026-04-21.
- Customer-success notified; no tenant required additional direct comm beyond the initial enterprise email.
Sign-off¶
| Role | Name | Date |
|---|---|---|
| Incident Commander | Carol Mendez | 2026-04-19 |
| Platform Lead | Carol Mendez | 2026-04-19 |
| Eng Director | Frank Okolo | 2026-04-20 |
| CTO | Grace Iwata | 2026-04-21 |
Reader's takeaway¶
This postmortem demonstrates the discipline the template asks for:
- The 5 Whys went all the way to a system-level cause — the absence of cross-service invariant enforcement — rather than stopping at "Bob should have caught it in code review."
- Every action item has all four fields (owner, ticket, due date, priority). Items without all four don't ship; the platform lead pushes back during postmortem review.
- What-went-right is as long as what-went-wrong. The resilience patterns that worked deserve as much attention as the gaps; if you only document failures, the team forgets why the patterns matter.
- The timeline is dense, not narrative. UTC timestamps, specific commands, specific links. A reader new to this incident can reconstruct what happened in 10 minutes.
- The customer-impact section uses real numbers. $58,000 deferred
- $12,000 credits + $3,400 in discounts = $73,400 total impact. Numbers drive prioritization of the action items in the next sprint.
- The author is the IC, not the on-call who handled the alert. This is by design: the IC has the cross-functional view; the on-call has the technical view; the postmortem needs both, but the IC has the better starting position for the summary + customer-impact sections.