Runbook — BookstoreCatalogP99Latency (P1)¶

Alert fired because catalog p99 latency exceeded 500 ms for 5 minutes in at least one tenant + region. This page is the on-call's working memory — execute the steps in order; do not skip.

Alert¶

Name: BookstoreCatalogP99Latency
Severity: page (P1)
Source: examples/bookstore-platform/observability/prometheus-rules.yaml

Query (PromQL):

bookstore:http_latency_p99:by_tenant_service{service="catalog"} > 0.5

Dashboard: https://grafana.bookstore-platform.example.com/d/bookstore-overview?var-service=catalog
Labels (in the page): tenant, region — tells you the blast radius.

Step 1 — Check (< 60s)¶

Confirm the alert is real and current, not a flapping or stale one.

# Re-query the metric directly (skips the recording rule layer).
kubectl --context kind-bookstore-platform-${REGION} \
  -n prometheus-system exec -ti prometheus-kube-prometheus-stack-prometheus-0 -- \
  promtool query instant http://localhost:9090 \
  "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{service=\"catalog\",tenant=\"${TENANT}\",region=\"${REGION}\"}[5m])))"
# Expected output: a single float. If > 0.5, alert is real.

# Are the catalog pods Running?
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} get pods -l app=catalog -o wide
# All Running 1/1 -> alert real; not deploy-in-progress.

Decision tree: - Latency < 500 ms now → flapping; check the alert's for: duration (5m) is sane; silence for 30 min; investigate next business day. - Latency > 500 ms + all Pods Running → continue to Step 2. - Latency > 500 ms + some Pods CrashLoopBackOff → jump to Step 3 (Mitigate), scale-up branch.

Step 2 — Diagnose (ordered)¶

Localise the cause: DB? CPU? GC? Network? Downstream?

2a. CPU / memory saturation¶

kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} top pods -l app=catalog
# NAME                       CPU(cores)   MEMORY(bytes)
# catalog-7c8d9-abc12        450m         320Mi
# catalog-7c8d9-def34        890m / 1000m 510Mi          <- near CPU limit

# Cross-reference against the resource limits.
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} get pod -l app=catalog -o jsonpath='{.items[*].spec.containers[*].resources}'

If CPU usage > 80 % of limit → CPU starvation. Jump to Step 3 (scale-up).

2b. Database health¶

# Replication lag from the v2 CNPG cluster.
kubectl --context kind-bookstore-platform-${REGION} \
  -n cnpg-system exec bookstore-platform-cnpg-1 -- \
  psql -U postgres -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;"
# Lag > 30 MB -> replication problem; cross-ref runbook-database-replication-lag.md

# Connection pool exhaustion.
kubectl --context kind-bookstore-platform-${REGION} \
  -n cnpg-system exec bookstore-platform-cnpg-1 -- \
  psql -U postgres -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Many 'idle in transaction' -> connection leak in the catalog code.

# Slow queries.
kubectl --context kind-bookstore-platform-${REGION} \
  -n cnpg-system exec bookstore-platform-cnpg-1 -- \
  psql -U postgres -c "SELECT query, total_exec_time, calls FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"

2c. Downstream calls (search, recommendations)¶

# Look at the catalog spans in Tempo for the failing tenant.
# Grafana -> Explore -> Tempo:
#   { service.name = "catalog" && resource.tenant = "${TENANT}" && duration > 500ms }
# What span dominates? If it is a search-svc call -> jump to
# runbook-search-latency.md.

2d. GC / runtime pauses¶

# Pull catalog logs; look for Go GC pause logs.
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} logs -l app=catalog --tail=200 | grep "GC paused"
# > 100ms -> heap pressure; the workload's memory limit needs review.

2e. Network — Istio sidecar issues¶

# Check sidecar restart count.
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} get pods -l app=catalog \
  -o jsonpath='{.items[*].status.containerStatuses[?(@.name=="istio-proxy")].restartCount}'
# > 5 in the last hour -> sidecar instability; jump to runbook-istio-sidecar.md.

Step 3 — Mitigate¶

The action that stops the bleeding. Run ONE; do not chain mitigations without re-checking the metric.

3a. Scale up (CPU starvation path)¶

# Increase catalog replicas. Safety: this hits the tenant's
# ResourceQuota (Crossplane composition stamps 10 CPU / 20 Gi memory).
# Confirm headroom first.
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} describe resourcequota
# Used  CPU: 5 / 10  Memory: 10Gi / 20Gi  -> room to scale
#
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} scale deployment/catalog --replicas=6
# Watch:
kubectl --context kind-bookstore-platform-${REGION} \
  -n bookstore-platform-${TENANT} rollout status deployment/catalog
# Re-check latency in 5 min.

3b. Rollback (a recent deploy caused it)¶

# Check ArgoCD for a recent sync. If a deploy landed within the last
# 30 min and the alert fired AFTER, rollback.
argocd app history bookstore-catalog-${REGION}
# Pick the previous revision.
argocd app rollback bookstore-catalog-${REGION} <REVISION>
# Watch the sync + re-check latency.

3c. Failover the region (P0 escalation)¶

If the entire region is degraded AND the latency is causing customer checkout failures, escalate to P0 and execute the DR runbook:

# Page the platform team. Open the DR drill script (extended for
# real failover).
bash examples/bookstore-platform/runbooks/dr-drill-script.sh \
  --region ${REGION} \
  --target-region <NEXT-REGION> \
  --dry-run=false
# Customer comm required within 15 min.

Safety check (every mitigation)¶

After each action, re-query the metric. If latency drops below 500 ms, the mitigation worked; if not, return to Step 2 and pick the next branch.

# Re-query
kubectl --context kind-bookstore-platform-${REGION} \
  -n prometheus-system exec -ti prometheus-kube-prometheus-stack-prometheus-0 -- \
  promtool query instant http://localhost:9090 \
  "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{service=\"catalog\",tenant=\"${TENANT}\",region=\"${REGION}\"}[5m])))"

Step 4 — Communicate¶

P1: customer comm via Slack #bookstore-platform-status within 1 hour. Subject: "Catalog latency elevated for tenant ${TENANT} in ${REGION} — investigating."
P0 escalation: also email tenant primary contact; status-page update.

Step 5 — Postmortem¶

Open the template: examples/bookstore-platform/runbooks/postmortem-template.md.
Due within 48 hours for a P1.
Action items required + tracked in the platform's GitHub Project board.

Common false positives¶

Cold start spike after deploy. A new replica's first 60s of requests can spike p99. Mitigation: for: 5m already covers; if flapping, increase to for: 10m.
One-off slow query from a debug endpoint. A diagnostic endpoint someone called manually. Mitigation: silence the alert for 30 min if confirmed; do not change the threshold.
Cluster autoscaling churn. New nodes coming online during a scale-up. The Pod startup time spikes latency briefly.

runbook-payments-failure-rate.md — when the catalog calls flow into payment failures.
runbook-database-replication-lag.md — when the slow query is at the DB layer.
dr-drill-script.sh — the regional failover script for the P0 escalation path.

When this runbook last worked¶

Date	Tenant	Region	Resolved by	Notes
2026-04-12	acme-books	us-east	Step 3a (scale to 6)	mid-month traffic spike
2026-03-04	foo-books	eu-west	Step 3b (rollback)	bad query in catalog-1.4.2

If this table is older than 90 days, the runbook is STALE; rehearse it at the next chaos game-day before trusting.