Runbook — hotfix workflow (emergency change procedure)¶
When to reach for this: a P0 incident is live; the normal-flow CI/CD pipeline (lint → test → scan → sign → push → Argo CD sync) is too slow to mitigate. You need to ship code/config in 5-15 minutes, not 45-60. This runbook is the emergency lane: bypass branch protection, use a CI fast-path, ship the fix, then clean up.
The non-negotiable: every hotfix is followed by a postmortem + a return-to-normal-flow PR + a CloudTrail review. Hotfixes are a tool, not a norm. The "hotfix became normal" anti-pattern is what this runbook is most concerned to prevent.
When NOT to use this runbook¶
- P1 / P2 incidents. The normal pipeline is fast enough; the cost of the hotfix (cleanup, audit, postmortem) exceeds the benefit.
- A bad image you want to roll BACK. Use
../rollback/code-rollback-argocd.mdor../rollback/code-rollback-rollouts.md; rollback is faster than hotfix and lower risk. - A bad config you want to revert. Use
../rollback/config-rollback-argocd-app.md. - The fix is unknown. Hotfix is for shipping a KNOWN fix fast. If you don't know the fix yet, you're still in the "diagnose" phase of the incident — open the relevant alert runbook, not this one.
When to use this runbook¶
- P0: production is down for ≥ 50% of customers; OR data loss; OR safety/regulatory.
- The fix is known and small (a one-line code change, a config toggle, a feature-flag flip).
- The normal CI pipeline cannot ship in time (e.g. the test suite takes 30 min; the customer-impact clock is 15).
Pre-flight (≤ 30 seconds)¶
Three quick decisions:
- Can a feature-flag flip mitigate? If yes — flip it FIRST
(
../feature-flags/README.md). 60 seconds to mitigate; no PR needed; no postmortem-debt accrued. Always try this first. - Can a rollback mitigate? If yes — rollback FIRST (see the "When NOT to use" section). 2-5 minutes to mitigate.
- Hotfix is the right tool when (1) and (2) don't apply. A forward-only fix is required (e.g. the bug is in a brand-new feature with no rollback target).
The hotfix lane¶
flowchart LR
inc["P0 incident
open"]
decision{"flag/rollback
first?"}
flag["FLIP FLAG
(60s)"]
rb["ROLLBACK
(2-5m)"]
hf["HOTFIX LANE
(below)"]
inc --> decision
decision -->|"flag covers it"| flag
decision -->|"rollback fixes it"| rb
decision -->|"forward-only fix needed"| hf
flag --> done["mitigated
(then postmortem)"]
rb --> done
hf --> done
Step 1 — Declare a hotfix (≤ 60 seconds)¶
# In #bookstore-platform-status:
# "DECLARING HOTFIX — INC-2026-05-20-001 P0
# Symptom: <one sentence>
# Approver: <name; an SRE Lead OR Eng Manager>
# Cleanup-owner: <name; usually the hotfix author>
# Expected merge ETA: <HH:MM UTC>"
The approver is named upfront. Without an approver, the bypass buttons in Step 3 / Step 4 should NOT be used. The cleanup-owner is responsible for the return-to-normal-flow PR within 24 hours and the postmortem within 48.
Step 2 — Write the fix (≤ 5 minutes)¶
# Branch off main / prod with the conventional hotfix prefix.
git checkout -b hotfix/INC-2026-05-20-001-catalog-nil-deref
# Apply the smallest possible change. ONE concern per hotfix.
# DO NOT bundle "while I'm in here" cleanups; they bypass review.
git diff
# Should be a small, surgical diff. If > 50 lines, you're doing too
# much in one hotfix.
git commit -m "hotfix(catalog): nil-deref on missing tenant header
INC-2026-05-20-001
approver: @alice
cleanup: @bob will open follow-up PR with tests within 24h."
The commit message MUST include the incident ID and the approver handle — the audit trail starts here.
Step 3 — Branch protection bypass¶
Branch protection rules normally require: - ≥ 1 reviewer approval - All required status checks (lint, test, scan, signature) passing - Up-to-date with base
For the hotfix lane, repo admins can bypass via GitHub's "Merge
without review" option. The Bookstore Platform repo's main branch
allows bypass for the @platform-admins team only; admin actions
post to #bookstore-platform-audit.
# Push the branch.
git push origin hotfix/INC-2026-05-20-001-catalog-nil-deref
# Create the PR.
gh pr create \
--title "HOTFIX: INC-2026-05-20-001 catalog nil-deref on missing tenant header" \
--body "P0 incident — bypassing normal review.
Approver: @alice
Cleanup-owner: @bob
Cleanup PR: <will follow within 24h>
Postmortem: <will follow within 48h>
The fix is single-line; the bug is documented inline.
"
Do not click "Merge" yet — the CI fast-path runs first (Step 4).
Step 4 — CI fast-path¶
The fast-path workflow (.github/workflows/hotfix-fastpath.yml; see
examples/bookstore-platform/ci/ for the canonical app pipeline)
runs ONLY:
- Lint (gofmt, golangci-lint) — < 2 min.
- Unit tests for the changed package only — < 2 min.
scan(Trivy / Snyk) — KEPT even in fast-path; never bypassed.- Build + cosign sign + push to ECR — < 4 min.
The fast-path SKIPS: - Full integration test suite (the slow piece; 15-30 min). - E2E browser tests. - Performance regression tests.
These re-run on the cleanup PR (Step 7).
Trigger:
# The fast-path workflow runs on PRs labeled `hotfix`.
gh pr edit <PR_NUMBER> --add-label hotfix
# This kicks off .github/workflows/hotfix-fastpath.yml; takes ~5 min.
gh pr checks <PR_NUMBER>
# Wait for: lint, unit-changed-package, scan, build-sign-push. All green.
If scan fails on a new CVE — stop. A hotfix that introduces a
CVE is not a hotfix. Flag flip / rollback instead.
Step 5 — Merge + sync¶
# Admin merge (bypasses required reviewers).
gh pr merge <PR_NUMBER> --admin --squash
# Audit log entry posts to #bookstore-platform-audit:
# "@alice merged PR #4321 (HOTFIX:...) via admin bypass"
# Argo CD sync — usually auto-sync picks it up in <60s, but for a
# hotfix, force the sync to skip the wait.
argocd app sync bookstore-catalog-us-east --prune
argocd app wait bookstore-catalog-us-east --health --sync
# Application healthy + synced -> mitigation in flight.
# (If catalog is deployed via Argo Rollouts canary, the Rollout will
# ramp 10% → 25% → 50% → 100% over 10-30 minutes. For P0 hotfixes,
# consider `kubectl argo rollouts set image` + `promote --skip-current-step`
# to skip the canary entirely. The trade-off: faster mitigation, no
# canary safety net.)
Step 6 — Verify¶
Same as any rollback: re-query the metric that triggered the incident; confirm recovery.
# Symptom metric, e.g. error rate:
kubectl -n prometheus-system exec -ti prometheus-kube-prometheus-stack-prometheus-0 -- \
promtool query instant http://localhost:9090 \
'sum(rate(http_requests_total{service="catalog",code=~"5.."}[5m])) / sum(rate(http_requests_total{service="catalog"}[5m]))'
# Expect: drops below 0.01 within 2-3 minutes.
If the symptom does NOT recover, the hotfix was wrong. Roll back the hotfix (it's just a commit; revert + sync). Do not chain another hotfix; treat the situation as a new incident.
Step 7 — Cleanup (within 24 hours)¶
The cleanup-owner (named in Step 1) opens a follow-up PR:
git checkout -b cleanup/INC-2026-05-20-001-catalog-tests
# Add the tests that would have caught the bug.
# Add the documentation.
# Add the integration test that runs in the slow pipeline.
# Address any "while I was there" issues SEPARATELY from the hotfix.
gh pr create \
--title "Cleanup: INC-2026-05-20-001 catalog hotfix - tests + docs" \
--body "Follows hotfix #4321 in INC-2026-05-20-001.
Includes:
- Regression test for the nil-deref (unit + integration).
- Inline docs explaining the missing-header path.
- A Kyverno policy to prevent the class of bugs (Tier 4 example only).
Reviewers: standard flow; no bypass.
"
This PR runs the FULL pipeline (no fast-path); reviewers approve; merge through the normal flow. The cleanup PR is mandatory.
Step 8 — Postmortem (within 48 hours)¶
Open ../runbooks/postmortem-template.md. Mandatory hotfix-specific
sections:
- Why was a hotfix needed? "Normal CI is too slow" is not a root cause; "the test suite is 30 min; we need it to be 10 min" is an action item.
- Was branch protection bypass justified? If 80% of postmortems answer "yes, but the hotfix could have been a rollback", → action item: train the team on the rollback playbook.
- Was the breakglass IAM role used? If so, see
cleanup-after- breakglass.mdfor the credential-rotation requirements. - What's the cleanup PR status? Linked + merged by the postmortem deadline (48h).
Step 9 — Audit log review (within 1 week)¶
The hotfix admin-merge + breakglass actions appear in:
- GitHub's audit log (@platform-admins actions).
- CloudTrail (every AWS API call made under the breakglass role).
- #bookstore-platform-audit (every admin action posts here via
GitHub App).
A platform-team member (NOT the hotfix author) reviews:
- Every
admin mergein the hotfix window — was the approver correct? - Every breakglass action — was the scope minimal?
- Any unrelated actions in the breakglass session — investigate.
Findings go into the postmortem.
The footguns¶
"I used breakglass and forgot to clean up"¶
The breakglass role's TTL (cleanup-after-breakglass.md) is 1 hour.
After expiry, the credentials are auto-rotated by the Vault auth
method. But: the artifacts the breakglass user created (RDS
snapshots, IAM users, S3 buckets) outlive the role. The cleanup
checklist in cleanup-after-breakglass.md is mandatory.
"The hotfix became normal"¶
If the same engineer ships > 1 hotfix per quarter, the team has a process bug, not a tool. Mitigations: - The platform-team's quarterly review counts hotfix PRs by author; > 4 per quarter is a code-yellow conversation. - A rate-limit on hotfixes: > 3 hotfixes in a 7-day window triggers a mandatory team review BEFORE the next one. - A CI investment: most "we hotfixed because CI is too slow" has the same fix — make CI faster (parallelise tests, cache deps, remove flaky ones). This is the unsexy long-term fix.
"I bypassed scan because it failed"¶
The fast-path NEVER skips scan. A hotfix that ships a known CVE
is a security event in addition to the original incident.
"I shipped multiple unrelated changes in one hotfix"¶
One concern per hotfix. Bundling "while I was there" changes bypasses review for those changes; that's a process bug masked by the incident.
Quick reference¶
# 1. Declare in Slack with approver.
# 2. Branch + commit + push.
# 3. Open PR; add `hotfix` label.
# 4. Wait for fast-path CI (~5 min); all green.
# 5. Admin-merge; force Argo CD sync.
# 6. Verify the metric recovers.
# 7. Open cleanup PR within 24h.
# 8. Postmortem within 48h.
# 9. Audit log review within 1 week.
Related runbooks¶
cleanup-after-breakglass.md— for any hotfix that USED the breakglass IAM role.breakglass-iam-policy.json— the IAM policy for the breakglass role.../rollback/code-rollback-argocd.md— the rollback path (try this first; it's faster than hotfix).../feature-flags/README.md— the kill-switch path (try this first; it's faster than rollback).../runbooks/postmortem-template.md— the postmortem template for the 48h writeup.
When this runbook last worked¶
| Date | Incident | Resolved by | Notes |
|---|---|---|---|
| 2026-04-02 | INC-2026-04-02-001 catalog nil-deref | Hotfix; 14 min mitigate-to-merge | flag flip would have worked too; hotfix author chose forward fix |
| 2026-01-29 | INC-2026-01-29-002 orders saga lock | Rollback; hotfix path NOT used | postmortem confirmed rollback was correct call |
Stale after 90 days without exercise. Quarterly chaos game-day SHOULD include a hotfix-lane drill (a fake P0 + the team runs the full sequence in dry-run mode).