Runbook — BookstoreCNPGReplicationLag (P1)¶
CNPG replication lag exceeded 30 seconds for 5 minutes. The standby region's view of writes is stale; a regional failover NOW would lose data.
Alert¶
- Name:
BookstoreCNPGReplicationLag - Severity:
page(P1) - Source:
examples/bookstore-platform/observability/prometheus-rules.yaml - Query (PromQL):
cnpg_pg_replication_lag > 30 - Dashboard: https://grafana.bookstore-platform.example.com/d/cnpg-overview
Step 1 — Check (< 60s)¶
# Is the alert real?
kubectl --context kind-bookstore-platform-${REGION} \
-n cnpg-system exec bookstore-platform-cnpg-1 -- \
psql -U postgres -c "
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
extract(epoch from (now() - reply_time))::int AS lag_seconds
FROM pg_stat_replication;
"
# application_name lag_bytes lag_seconds
# eu-west-replica 128 MB 45 <- > 30s = alert real
# Are both CNPG clusters healthy?
kubectl --context kind-bookstore-platform-${REGION} \
-n cnpg-system get cluster
kubectl --context kind-bookstore-platform-eu-west \
-n cnpg-system get cluster
Decision tree: - Lag < 30 s now → flapping; silence + investigate. - Lag > 30 s, both clusters healthy → continue to Step 2. - Lag > 30 s, replica cluster degraded → Step 3 (Mitigate).
Step 2 — Diagnose (ordered)¶
2a. Network latency between regions¶
# Run a quick latency probe from the primary's pod to the replica's
# replication endpoint.
kubectl --context kind-bookstore-platform-${REGION} \
-n cnpg-system exec bookstore-platform-cnpg-1 -- \
curl -s -o /dev/null -w "%{time_total}s\n" \
https://cnpg-eu-west.bookstore-platform.example.com:5432
# > 200ms RTT cross-region = expected for transatlantic; > 1s = network
# issue; jump to 3a.
2b. Replica I/O saturation¶
kubectl --context kind-bookstore-platform-eu-west \
-n cnpg-system exec bookstore-platform-cnpg-1 -- \
iostat -x 1 5
# Watch %util on the data disk. > 90 % sustained = I/O bottleneck;
# the replica cannot keep up with WAL replay.
2c. Long-running query on the replica¶
kubectl --context kind-bookstore-platform-eu-west \
-n cnpg-system exec bookstore-platform-cnpg-1 -- \
psql -U postgres -c "
SELECT pid, state, query_start, query FROM pg_stat_activity
WHERE state='active' AND now() - query_start > interval '1 minute';
"
# A read query on the replica can block WAL replay if it locks rows
# WAL replay wants to update. The PG `hot_standby_feedback` setting
# determines who wins; v2 defaults to feedback=on (replica's query
# wins, primary's vacuum waits).
2d. Primary write spike¶
# Is the primary writing more than usual?
kubectl --context kind-bookstore-platform-${REGION} \
-n prometheus-system exec -ti prometheus-kube-prometheus-stack-prometheus-0 -- \
promtool query instant http://localhost:9090 \
"rate(cnpg_pg_stat_bgwriter_wal_bytes_total[5m])"
# A 10x spike = a batch job. Jump to 3c (throttle the batch).
2e. WAL retention¶
# Has the primary moved past the replica's slot?
kubectl --context kind-bookstore-platform-${REGION} \
-n cnpg-system exec bookstore-platform-cnpg-1 -- \
psql -U postgres -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn, active FROM pg_replication_slots;"
# active = false on a slot = the replica disconnected; reconnect
# needed. CNPG should handle this; if not, kick the replica's pod.
Step 3 — Mitigate¶
3a. Cross-region network issue¶
# Likely network-side; communicate to networking team; the platform
# response is to NOT failover (the replica is too far behind).
# Page the on-call networking SRE.
# Block any planned failovers until the alert clears.
3b. Restart the replica (replica-side stuck)¶
# Trigger a restart of the replica's primary pod (CNPG promotes/
# demotes as needed).
kubectl --context kind-bookstore-platform-eu-west \
-n cnpg-system delete pod bookstore-platform-cnpg-1
# CNPG re-creates; replication resumes.
3c. Throttle a primary write spike¶
# If 2d shows a batch job, throttle it. Common culprits:
# - search reindex job (jump to runbook-search-reindex.md)
# - Debezium snapshot (the outbox-table snapshot)
# - a bulk-import script
# Pause Debezium connector (if it's the cause):
kubectl --context kind-bookstore-platform-${REGION} \
-n debezium exec debezium-connect-0 -- \
curl -s -X PUT http://localhost:8083/connectors/bookstore-outbox/pause
# Resume once the lag drops.
3d. Failover to a different replica (if multi-replica)¶
If we have a third replica in a third region, fail over to that one (do not fail over to the lagging replica). DR script:
bash examples/bookstore-platform/runbooks/dr-drill-script.sh \
--region ${REGION} \
--target-region <HEALTHY-REPLICA-REGION>
Safety check¶
# Re-query the lag.
kubectl --context kind-bookstore-platform-${REGION} \
-n cnpg-system exec bookstore-platform-cnpg-1 -- \
psql -U postgres -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;"
# Falling = working; not falling = back to Step 2.
Step 4 — Communicate¶
- P1: announce in
#bookstore-platform-statuswithin 1 hour. Customers are not directly affected (writes still succeed on primary); the comm is to inform the team that a failover is not safe until this clears. - P0 escalation: if the lag prevents a planned failover during a separate incident.
Step 5 — Postmortem¶
- Required within 48 hours.
- Template:
examples/bookstore-platform/runbooks/postmortem-template.md. - Special section: was the RPO target breached? (For us, > 5 min lag = RPO breach.)
Common false positives¶
- Maintenance window. CNPG runs a scheduled
VACUUMon Sundays at 02:00 UTC; lag can spike to 60 s briefly. The alert'sfor: 5mfilters most of these; if not, expand tofor: 10m. - WAL archiving slow. A slow S3 endpoint can delay WAL archiving;
CNPG sometimes reports this as lag (it isn't, exactly). Mitigation:
ensure S3 endpoint resolves correctly; the chapter's
endpoint-url: s3.us-east-1.amazonaws.comis the regional endpoint (cross-region writes through a global endpoint add ~50ms).
Related runbooks¶
runbook-api-latency-p99.md— when the lag manifests as slow API queries.runbook-payments-failure-rate.md— when the outbox publisher stalls due to lag.dr-drill-script.sh— DO NOT run during this alert.
When this runbook last worked¶
| Date | Region | Lag peak | Cause | Resolved by |
|---|---|---|---|---|
| 2026-04-15 | eu-west | 4m | Sunday VACUUM | self-resolved in 12m |
| 2026-02-09 | ap-southeast | 8m | network blip | Step 3b (replica restart) |
| 2026-01-20 | eu-west | 25m | Debezium snapshot run | Step 3c (pause connector) |
If older than 90 days, STALE; rehearse before trusting.