03 — Troubleshooting playbook¶
A systematic method (observe → isolate → hypothesize → test → fix;
describe→ events → logs →kubectl debug) and a decision tree per symptom:Pending,ImagePullBackOff,CrashLoopBackOff,OOMKilled,Evicted,CreateContainerConfigError, readiness-fail / no Endpoints, DNS failures, NetworkPolicy block, PSA rejection. The centerpiece — debugging the Bookstore's distroless containers:kubectl debugephemeral containers and pod copies with--profile=restricted(PSA-restricted-compatible), explicitly contrasted with whykubectl exec catalog -- shcannot work (distroless: no shell). Applied by deliberately inducing four failures in the live Bookstore without editing any canonical manifest, diagnosing each with the method, and reverting each so the cluster returns to known-good.
Estimated time: ~30 min read · ~90 min hands-on
Prerequisites: Part 01 ch.02 — probes and lifecycle as the failure signal · Part 06 ch.01 — alerts are the trigger; this chapter is the response · Part 02 ch.06 — NetworkPolicy is one symptom in the decision tree
You'll know after this: • run the observe → isolate → hypothesize → test → fix loop on a live failure · • read events, logs and describe output to diagnose Pending / ImagePullBackOff / CrashLoopBackOff / OOMKilled · • use kubectl debug ephemeral containers and --profile=restricted against distroless images · • diagnose readiness-fail, NetworkPolicy block, PSA rejection and DNS failure separately · • induce four Bookstore failures, fix each with the method, and revert to known-good
Why this exists¶
Every prior part built the Bookstore correctly. Production is mostly the
opposite activity: something that was working isn't, an alert fired
(Part 06 ch.01), and
you have minutes to find out why across Pods, Services, config, network policy,
and admission. The reflex — "tail the logs", "kubectl exec in and poke" —
fails in exactly this codebase: catalog, orders, and payments-worker
are distroless gcr.io/distroless/static:nonroot images
(Part 00 ch.02) with no shell,
no ps, no curl — kubectl exec catalog -- sh returns
exec: "sh": executable file not found. The correct tool is kubectl debug,
and the bookstore namespace's PSA-restricted enforcement constrains that
too.
Two failure modes make this its own chapter:
- No method → flailing. Without a fixed sequence (events before logs,
describebefore guesses, isolate before fix) an incident becomes randomkubectlcommands. A decision tree per symptom turns "it's broken" into a short, deterministic path to the cause. - The wrong debug tool for the workload.
kubectl exec … shis muscle memory and it fails on the Bookstore's own services. The right answer —kubectl debugwith an ephemeral container sharing the target's process/network namespace, on the--profile=restrictedprofile so PSA admits it — is a distinct skill this chapter teaches against the real app.
This is the operator's playbook: a method, a per-symptom tree, the correct
distroless debugging, and a hands-on that breaks the live app safely and puts
it back. The references are Production Kubernetes (Observability) and Lukša's
debugging material; kubectl debug is cited from the official docs.
Mental model¶
Triage is a fixed pipeline; the symptom selects the branch; kubectl debug
is how you get a shell when the container has none.
- One method, always the same order. Observe (what is the symptom —
kubectl getphase/status) → isolate (which object/layer — Pod? Service? config? network? admission?) → hypothesize (what would produce this symptom) → test (one cheap check that confirms/refutes) → fix (smallest change) → verify (the symptom is gone, nothing else broke). Skipping "isolate" is how you spend an hour on the wrong layer. describe→ events → logs → debug, in that order.kubectl describe(and its Events) explains scheduling/lifecycle failures (Pending, ImagePull, config errors) before any log exists.kubectl logs --previousexplains in-container failures (CrashLoop, bad config the app rejected).kubectl debugis the last resort, for when you need a shell, tools, or the process/network view the container itself can't give you.- The symptom is the index into the tree.
Pending≠CrashLoopBackOff≠ImagePullBackOff— each has a small, specific set of causes and a deterministic check order. You don't "debug Kubernetes"; you follow the branch the Pod's status names. - Distroless changes the debug primitive, not the method.
cataloghas no shell, so step "get a shell to look around" becomeskubectl debug -it <POD> --image=<TOOLING> --target=<CTR>— an ephemeral container injected into the running Pod, sharing its process and network namespaces, with tools the distroless image lacks. The method is unchanged; only the tool that realises "look inside" differs. - PSA-
restrictedconstrains the debugger too. An ephemeral container joins the target Pod's namespace, so inbookstoreit must itself satisfyrestricted(Part 05 ch.02).kubectl debug --profile=restricted(GA 1.30) shapes the debug/ephemeral container (runAsNonRoot, drop ALL, seccomp RuntimeDefault) so admission accepts it — using a bare--image=busyboxwithout that profile is rejected by PSA.
The trap to internalise: kubectl exec catalog -- sh is the wrong reflex in
this codebase and it will waste your first two minutes. The Bookstore's Go
services are distroless on purpose (tiny attack surface, fewer CVEs — Part 00
ch.02); the cost is paid back
here, in knowing the debug primitive that actually works on them.
Diagrams¶
Diagram A — the master troubleshooting decision tree (Mermaid)¶
Symptom → check → cause → fix. Every branch is exercised in the Hands-on.
flowchart TD
start["Alert / 'it's broken'
kubectl get pods -n bookstore"]
start --> phase{"Pod status?"}
phase -->|Pending| pend["kubectl describe pod → Events"]
pend --> pendc{"Why unschedulable?"}
pendc -->|Insufficient cpu/mem| f1["raise limits / scale node /
check ResourceQuota (Part 04/05)"]
pendc -->|no node matches
affinity/taint| f2["fix nodeSelector/affinity/
toleration (Part 04)"]
pendc -->|PVC unbound /
WaitForFirstConsumer| f3["check PVC/StorageClass
(Part 03 ch.04)"]
pendc -->|"violates PodSecurity restricted"| f4["fix securityContext
(Part 05 ch.02)"]
phase -->|ImagePullBackOff| ipb["describe → Events: image ref"]
ipb --> ipbc{"Cause?"}
ipbc -->|typo / wrong tag| f5["fix image: ; kubectl set image"]
ipbc -->|"not kind-loaded"| f6["kind load docker-image"]
ipbc -->|private registry| f7["imagePullSecrets / IRSA"]
phase -->|CrashLoopBackOff| clb["kubectl logs --previous"]
clb --> clbc{"Why exiting?"}
clbc -->|bad/missing config| f8["fix ConfigMap/Secret/env"]
clbc -->|dependency down| f9["check the dep (DB/redis/mq)"]
clbc -->|OOM exit 137| f10["→ OOMKilled branch"]
phase -->|"Running, not Ready"| nr["describe → readiness probe
+ kubectl get endpoints"]
nr --> nrc{"Cause?"}
nrc -->|probe path/port wrong| f11["fix readinessProbe"]
nrc -->|"Endpoints empty:
selector ≠ labels"| f12["align Service selector /
pod labels (Part 02 ch.02)"]
nrc -->|NetworkPolicy blocks probe/dep| f13["add the allow
(Part 02 ch.06)"]
phase -->|"OOMKilled / Evicted /
CreateContainerConfigError"| other["describe → Reason"]
other --> otherc{"Which?"}
otherc -->|OOMKilled 137| f14["limit < working set:
raise mem limit (Part 01 ch.03)"]
otherc -->|Evicted| f15["node pressure: capacity /
requests (Part 06 ch.06)"]
otherc -->|"CreateContainerConfigError"| f16["missing ConfigMap/Secret KEY"]
note["Need a shell/tools/process view?
distroless catalog/orders → NOT exec sh
kubectl debug --profile=restricted"]
clb -.-> note
nr -.-> note
Diagram B — per-symptom quick table (ASCII)¶
SYMPTOM → FIRST CHECK → LIKELY CAUSE → FIX (xref = where it's taught) ──────
STATUS / Reason FIRST COMMAND LIKELY CAUSE → FIX
───────────────────────────────────────────────────────────────────────────
Pending describe pod → Events no fit: resources/
quota/affinity/taint/
PVC/PSA → fix that
(Part 04/05/03)
ImagePullBackOff / describe → Events (image) bad tag | not
ErrImagePull kind-loaded | private
→ fix ref / kind load
CrashLoopBackOff logs --previous bad config | dep down
| OOM → fix cause
OOMKilled (exit 137) describe → Last State limit < working set
→ raise mem limit
(Part 01 ch.03)
Evicted describe node → Conditions node mem/disk
pressure → capacity/
requests (P06 ch.06)
CreateContainerConfigError describe → Events missing ConfigMap/
Secret KEY → add key
Running, 0/1 READY describe → Readiness + probe wrong | no
get endpoints <SVC> Endpoints (selector
≠ labels) → align
(Part 02 ch.02)
No Endpoints get endpoints; get pods Service selector ≠
--show-labels pod labels → fix
DNS failure (name lookup) debug → nslookup; check CoreDNS down |
ndots/resolv.conf egress-DNS denied
(Part 02 ch.03/06)
Conn refused/timeout to dep debug → nc; get netpol NetworkPolicy missing
an allow (BOTH ends)
(Part 02 ch.06)
"violates PodSecurity the apply/admission error securityContext not
restricted" itself restricted → fix
(Part 05 ch.02)
NEED A SHELL on catalog/orders/payments-worker (DISTROLESS — no shell)?
✗ kubectl exec catalog -- sh → "sh: not found" (WRONG reflex)
✓ kubectl debug -it <POD> --image=nicolaka/netshoot \
--target=<CTR> --profile=restricted (ephemeral container)
✓ kubectl debug <POD> --copy-to=<DBG> \
--set-image=<CTR>=busybox --profile=restricted (debug a COPY)
✓ kubectl debug node/<NODE> -it --profile=sysadmin (node-level)
Hands-on with the Bookstore¶
Assumed working directory: the guide repo root (full-guide/). This
chapter adds no manifests. It induces failures in the live Bookstore
using kubectl set image / kubectl patch / kubectl set env / a throwaway
copied manifest — never by editing raw-manifests/* (those stay the
known-good source of truth) — diagnoses each with the method, and reverts
each so the cluster is returned to known-good.
Discipline note (read this first). Every induced failure below is applied with an imperative command or a throwaway copy, and every one shows its exact revert. No file under
examples/bookstore/is modified. After the chapter the Bookstore is byte-for-byte the canonical app — verified by a finalkubectl rollout statuson every workload.
0. Prerequisites — fresh cluster + images + the known-good app¶
kind delete cluster --name bookstore 2>/dev/null || true
kind create cluster --name bookstore
cd examples/bookstore/app
for s in catalog orders payments-worker storefront; do docker build -t bookstore/$s:dev ./$s; done
cd ../../..
for s in catalog orders payments-worker storefront; do kind load docker-image bookstore/$s:dev --name bookstore; done
kubectl apply -f examples/bookstore/raw-manifests/00-namespace.yaml
kubectl apply -f examples/bookstore/raw-manifests/05-serviceaccounts-rbac.yaml
kubectl apply -f examples/bookstore/raw-manifests/15-catalog-config.yaml
kubectl apply -f examples/bookstore/raw-manifests/16-db-credentials.yaml
kubectl apply -f examples/bookstore/raw-manifests/35-priorityclasses.yaml
kubectl apply -f examples/bookstore/raw-manifests/12-redis.yaml
kubectl apply -f examples/bookstore/raw-manifests/13-rabbitmq.yaml
kubectl apply -f examples/bookstore/raw-manifests/20-postgres-statefulset.yaml
kubectl apply -f examples/bookstore/raw-manifests/40-services.yaml
kubectl apply -f examples/bookstore/raw-manifests/10-catalog-deploy.yaml
kubectl apply -f examples/bookstore/raw-manifests/11-storefront-deploy.yaml
kubectl apply -f examples/bookstore/raw-manifests/14-orders-deploy.yaml
kubectl apply -f examples/bookstore/raw-manifests/19-payments-worker-deploy.yaml
kubectl apply -f examples/bookstore/raw-manifests/21-db-migrate-job.yaml
# the migration Job must COMPLETE (creates the `books` schema) before
# catalog/orders can become Ready — wait for it BEFORE the deploy wait:
kubectl wait --for=condition=complete job/db-migrate -n bookstore --timeout=120s
kubectl wait --for=condition=available deploy --all -n bookstore --timeout=180s
kubectl get pods -n bookstore # ALL Running/Ready — the baseline
Self-bootstrapping note. After any
kind delete && kind createre-kind loadthe four images and re-run this chain — the failures below assume this exact known-good baseline so each revert provably returns to it.
1. The distroless reality check (do this once — it reframes everything)¶
Before inducing anything, prove why the rest of the chapter uses kubectl
debug:
kubectl exec -n bookstore deploy/catalog -- sh
# error: ... exec: "sh": executable file not found in $PATH
# catalog is distroless (gcr.io/distroless/static:nonroot, Part 00 ch.02):
# NO shell, NO coreutils, NO package manager. `exec ... sh` CANNOT work here.
kubectl exec -n bookstore statefulset/postgres -- sh -c 'echo postgres HAS a shell'
# postgres HAS a shell ← the official postgres image is NOT distroless; exec
# sh works there (and for redis/rabbitmq). The Bookstore is mixed: know which.
# (postgres is a StatefulSet, not a Deployment — use `statefulset/postgres`
# or the pod `postgres-0`; `deploy/postgres` would error: not found.)
The correct primitive for the distroless services — an ephemeral container,
PSA-restricted-shaped:
kubectl debug -it -n bookstore deploy/catalog \
--image=nicolaka/netshoot --target=catalog --profile=restricted -- /bin/bash
# • --target=catalog → share catalog's PROCESS namespace (see its PID, /proc,
# its env via /proc/1/environ) and NETWORK namespace (curl ITS localhost).
# • --profile=restricted (GA 1.30) → the ephemeral container is shaped
# runAsNonRoot + drop ALL + seccomp RuntimeDefault, so the bookstore ns's
# PSA `restricted` ADMITS it. Without the profile, PSA rejects it:
# Error ... violates PodSecurity "restricted:latest"
# Inside: `curl -s localhost:8080/healthz`, `nslookup postgres`,
# `nc -vz postgres 5432`, `ps aux` (sees catalog's PID 1). Exit; the
# ephemeral container is gone, the Pod is otherwise untouched.
Why not just
kubectl debug … --image=busybox? busybox does not run as non-root by default, so PSArestrictedrejects it inbookstore. Either use--profile=restricted(shapes it to comply — preferred) or debug a copy in thedefaultnamespace (no PSA):kubectl debug -n bookstore catalog-xxxx --copy-to=dbg --namespace=default --set-image=catalog=busyboxThe profile is the clean answer; the copy-to-
defaultis the fallback when you need a tool image that can't be made restricted-compliant.
2. Induced failure A — ImagePullBackOff (and its revert)¶
# INDUCE (imperative — canonical 10- untouched): point catalog at a bad tag.
kubectl set image deploy/catalog catalog=bookstore/catalog:nonexistent -n bookstore
# DIAGNOSE (method: observe → describe → Events; NO logs needed — never ran):
kubectl get pods -n bookstore -l app=catalog # STATUS ImagePullBackOff
kubectl describe pod -n bookstore -l app=catalog | sed -n '/Events:/,$p'
# Failed to pull image "bookstore/catalog:nonexistent": ... not found
# → cause: image ref the node has no image for (here: never built/kind-loaded;
# in real life a typo or missing imagePullSecret). describe+Events, not logs,
# because the container never started — there is no log to read.
# REVERT (back to the canonical image/tag → known-good):
kubectl set image deploy/catalog catalog=bookstore/catalog:dev -n bookstore
kubectl rollout status deploy/catalog -n bookstore # Available again
3. Induced failure B — CrashLoopBackOff via broken config (and its revert)¶
# INDUCE (imperative env override; canonical 10-/15-/16- untouched): point
# catalog's DB_DSN at a host that does not resolve so it fails to start cleanly.
kubectl set env deploy/catalog -n bookstore \
DB_DSN="host=nope.invalid port=5432 user=bookstore password=devpassword dbname=bookstore sslmode=disable"
# DIAGNOSE (method: status → PREVIOUS logs, because it crashed):
kubectl get pods -n bookstore -l app=catalog # CrashLoopBackOff / Error
kubectl logs -n bookstore deploy/catalog --previous | tail -20
# dial error / cannot connect to host=nope.invalid → cause: bad config the
# app rejected at startup. `--previous` = the LAST crashed container's logs
# (the current one may be in backoff with no logs yet). For a distroless app
# with no startup log, `kubectl debug --target=catalog` to read
# /proc/1/environ confirms the bad env without a shell in the image.
# REVERT — remove the override so the manifest's real DB_DSN is restored:
kubectl set env deploy/catalog -n bookstore DB_DSN-
# trailing `-` DELETES the imperative env var → the Deployment's own DB_DSN
# (built from db-credentials, 16-) is back, byte-identical to canonical.
kubectl rollout status deploy/catalog -n bookstore
4. Induced failure C — OOMKilled via a throwaway low-limit copy (and its revert)¶
Never edit the canonical limit — apply a named copy and delete it:
# INDUCE: a THROWAWAY Deployment that is the catalog with an absurd 8Mi memory
# limit (a copy — raw-manifests/10- is NOT touched). The Go binary's working
# set exceeds 8Mi → the kernel OOM-kills it; the container restarts → loop.
kubectl create deployment catalog-oom -n bookstore --image=bookstore/catalog:dev --dry-run=client -o yaml \
| kubectl patch --local -f - -o yaml --type=json -p='[
{"op":"add","path":"/spec/template/spec/securityContext","value":{"runAsNonRoot":true,"runAsUser":65532,"seccompProfile":{"type":"RuntimeDefault"}}},
{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value":{"allowPrivilegeEscalation":false,"runAsNonRoot":true,"runAsUser":65532,"capabilities":{"drop":["ALL"]},"seccompProfile":{"type":"RuntimeDefault"}}},
{"op":"add","path":"/spec/template/spec/containers/0/resources","value":{"requests":{"cpu":"50m","memory":"8Mi"},"limits":{"cpu":"100m","memory":"8Mi"}}}
]' | kubectl apply -f -
# ^ the copy is restricted-compliant (bookstore enforces PSA restricted even
# for a throwaway). Only the memory LIMIT is pathological.
# DIAGNOSE (method: status → describe → Last State Reason):
kubectl get pods -n bookstore -l app=catalog-oom # CrashLoopBackOff
kubectl describe pod -n bookstore -l app=catalog-oom | grep -A3 'Last State'
# Last State: Terminated Reason: OOMKilled Exit Code: 137
# → cause: limit (8Mi) < container working set. exit 137 = 128+SIGKILL(9) =
# the OOM killer. Fix in reality = raise the memory limit to the real need
# (Part 01 ch.03 — the canonical catalog limit is 128Mi, which is fine).
# REVERT — delete the throwaway entirely (canonical catalog never involved):
kubectl delete deployment catalog-oom -n bookstore
5. Induced failure D — readiness fails / no Endpoints via a bad probe (and its revert)¶
# INDUCE (JSON-patch the LIVE Deployment's readiness probe to a wrong path;
# canonical 10- untouched — this patch is reverted in the same step):
kubectl patch deploy/catalog -n bookstore --type=json -p='[
{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/wrong"}]'
# DIAGNOSE (method: Running-but-not-Ready → describe probe → Endpoints):
kubectl get pods -n bookstore -l app=catalog # 0/1 READY (Running)
kubectl describe pod -n bookstore -l app=catalog | grep -A2 Readiness
# Readiness probe failed: HTTP 404 (path /wrong doesn't exist)
kubectl get endpoints catalog -n bookstore
# ENDPOINTS <none> → KEY INSIGHT: a not-Ready Pod is REMOVED from the
# Service Endpoints (Part 02 ch.02), so `catalog` Service has no backends and
# storefront's calls fail — the symptom often shows up one hop away. (The
# "no Endpoints" cause can also be Service.selector ≠ pod labels — check
# `kubectl get pods --show-labels` vs the Service selector; here it's the
# probe.)
# REVERT — restore the canonical readiness path (/readyz):
kubectl patch deploy/catalog -n bookstore --type=json -p='[
{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/readyz"}]'
kubectl rollout status deploy/catalog -n bookstore # 1/1 READY; Endpoints back
Bonus branch — NetworkPolicy block (xref Part 02 ch.06). With
60-networkpolicy.yamlapplied, omitting the DNS-egress allow (or a both-ends app allow) presents as "name resolution failed" / "connection timed out" with healthy Pods and correct Endpoints — the tell that it's the network layer, not the app. Diagnose from an ephemeral debug container:kubectl debug -it -n bookstore deploy/catalog --image=nicolaka/netshoot --target=catalog --profile=restricted -- nslookup postgres(Fails → DNS egress denied.) vs
nc -vz postgres 5432(fails but DNS ok → the postgres ingress/own egress allow is missing — NetworkPolicy needs the allow on both ends). Revert = re-apply the full60-networkpolicy.yaml(it is whitelist-correct as shipped). On kind's default CNI NetworkPolicy is a silent no-op (ch.06) — this branch is fully visible only on a policy-enforcing CNI.
6. Verify the cluster is back to known-good (the discipline check)¶
kubectl rollout status deploy/catalog deploy/storefront deploy/orders deploy/payments-worker -n bookstore
kubectl rollout status statefulset/postgres -n bookstore
kubectl get pods -n bookstore # ALL Running/Ready, == step 0 baseline
kubectl get deploy catalog -n bookstore -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
# bookstore/catalog:dev ← reverted; NO canonical manifest was ever edited
git -C "$(pwd)" status --porcelain examples/bookstore/ 2>/dev/null || \
echo "examples/bookstore/ unchanged (induced failures were imperative/throwaway only)"
kind delete cluster --name bookstore
How it works under the hood¶
- Why
describe/Events before logs. Pod lifecycle failures (Pending, ImagePull, CreateContainerConfigError) happen before the container's process ever runs, so there is no log — the explanation is in the kubelet/scheduler Events on the Pod object (the API server records them;describesurfaces them). Logs only exist once the container started; for CrashLoop you want--previousbecause the current container is in backoff (often no fresh log) while the crashed one's stderr holds the cause. - What each status actually means.
Pending= the scheduler found no node (resources/affinity/taint/unbound-PVC) or admission rejected it (PSA) — it never bound to a node.ImagePullBackOff= the kubelet can't pull the image (bad ref / auth / not present) and is backing off.CrashLoopBackOff= the container starts then exits, repeatedly, with exponential backoff.OOMKilled(exit 137 = 128+SIGKILL) = the cgroup memory limit was exceeded and the kernel OOM-killer reaped it (Part 01 ch.03).Evicted= the kubelet reclaimed the Pod under node pressure (mem/disk).Create ContainerConfigError= a referenced ConfigMap/Secret key is missing so the container spec can't be materialised. - No Endpoints is a labels/readiness fact, not a mystery. A Service's
Endpoints/EndpointSlice is the set of Pods that match its selector and
are Ready. So "Service works but returns nothing" is exactly one of: (a)
Service.spec.selector≠ Pod labels (the selector indexes nothing — Part 02 ch.02), or (b) the Pods are not Ready (failing readiness → pulled from Endpoints — Part 01 ch.02).kubectl get endpoints <SVC>+kubectl get pods --show-labelsdisambiguates in two commands. kubectl debugephemeral containers — the real mechanism. An ephemeral container is added to a running Pod via the Pod'sephemeralcontainerssubresource — no restart, no spec mutation of the workload. With--targetit joins the target container's PID namespace (you see its process tree,/proc/1/environ, open FDs) and shares the Pod's network namespace (its localhost, its DNS resolv.conf, its NetworkPolicy view) — which is precisely why it works for a distroless container that has none of those tools itself: you bring the tools in a sidecar that can see the target's namespaces.--copy-toinstead creates a copy of the Pod you can mutate freely (swap the image, add a command) without touching the live one — the right tool when you need to change the entrypoint, not just observe.- Why PSA gates the debugger, and what the profiles do. Admission
(Part 05 ch.02) evaluates the whole
Pod including ephemeral containers; in
bookstore(enforce: restricted) an ephemeral container that runs as root / keeps capabilities / lacks seccomp is rejected just like any container.kubectl debug --profile(GA 1.30) sets the security shape:restricted→ runAsNonRoot + drop ALL + seccomp RuntimeDefault (admits in arestrictedns — use this for bookstore);general→ minimal (fine in non-PSA namespaces);sysadmin/netadmin→ privileged/NET_ADMIN for node debugging (kubectl debug node/<N>), which is a node, not the restricted app ns. kubectl debug node/<NODE>is a different thing. It schedules a debug Pod onto the node with the host filesystem at/hostand (with--profile=sysadmin) host namespaces — for node-level problems (kubelet, containerd, disk pressure, CNI). On managed clusters node access is the constraint: the node Pod still works, but SSH/host-level fixes may be provider-mediated (you debug, the provider remediates the node).
Production notes¶
In production: the alert must link to the runbook branch. An alert that just says "catalog down" restarts the flailing this chapter eliminates. Wire each alert (Part 06 ch.01) to the specific decision-tree branch and the relevant
kubectl describe/logs --previous/kubectl debugcommands (and, for data loss, toDR-RUNBOOK.md). The method is only fast if it's attached to the page, not in someone's head.In production:
kubectl debugis a privilege — RBAC + PSA gate it deliberately. Creating ephemeral containers needspods/ephemeralcontainers(andpods/execfor exec) — many orgs scope this to on-call/SRE only (Part 05 ch.01), because an ephemeral container with the right image is a powerful in-cluster foothold. In arestrictednamespace insist on--profile=restricted; for tooling that genuinely needs more, debug a copy in a non-PSA namespace rather than weakening the app namespace. Audit ephemeral containers (they appear in the audit log andkubectl get pod -o yamlunderephemeralContainers) — a debug session in prod is a security event, not a free action.In production: distroless means
kubectl debugis the only in-Pod option — design for it. You cannotexec shintocatalog/orders/payments-workerby design (smaller attack surface — Part 00 ch.02). Standardise a vetted, non-root debug image (e.g. a pinnednicolaka/netshoot) allowed by policy, and make sure process-namespace sharing is acceptable in your threat model (the debug container can read the target's memory/env via/proc). Prefer fixing forward via the declarative source (Git — re-sync, Part 07 ch.04) over livekubectl edit; a live edit is exactly the drift GitOps self-heals away.In production: node-level and managed-cluster debugging is constrained.
kubectl debug node/<N> --profile=sysadminis your node window when SSH is closed, but on EKS/GKE/AKS the node is the provider's: you can observe (host/host, processes, kubelet logs) yet remediation (replace the node, drain & recycle) is often the right move over in-place host fixes. Pair node debugging with the cordon/drain discipline from ch.01 — a sick node is usually replaced, not nursed.
Quick Reference¶
# THE METHOD: observe → describe(+Events) → logs(--previous) → debug → fix → verify
kubectl get pods -n bookstore # 1. observe (status)
kubectl describe pod -n bookstore <POD> # 2. Events (pre-run fails)
kubectl logs -n bookstore <POD> --previous # 3. crashed-container log
kubectl get endpoints <SVC> -n bookstore # no-Endpoints check
kubectl get pods -n bookstore --show-labels # selector vs labels
# DISTROLESS debug (catalog/orders/payments-worker — NEVER `exec ... sh`):
kubectl debug -it -n bookstore deploy/catalog \
--image=nicolaka/netshoot --target=catalog --profile=restricted -- /bin/bash
kubectl debug -n bookstore <POD> --copy-to=dbg \
--set-image=catalog=busybox --profile=restricted # debug a COPY
kubectl debug node/<NODE> -it --profile=sysadmin # node-level (/host)
# REVERT primitives used above (no canonical manifest is ever edited):
kubectl set image deploy/<D> <C>=<GOOD-IMAGE> -n bookstore
kubectl set env deploy/<D> -n bookstore VAR- # delete imperative env
kubectl patch deploy/<D> -n bookstore --type=json -p='[…restore…]'
kubectl delete deployment <THROWAWAY> -n bookstore
Minimal debug-invocation skeleton (the shapes that PSA-restricted accepts):
# ephemeral container into a distroless pod, restricted-compliant:
kubectl debug -it <POD> -n bookstore \
--image=<non-root tool image> --target=<CONTAINER> --profile=restricted
# a freely-mutable COPY (change entrypoint/image without touching the live pod):
kubectl debug <POD> -n bookstore \
--copy-to=<DBG> --set-image=<CONTAINER>=<IMAGE> --profile=restricted
Checklist:
- Followed the method (observe → isolate → hypothesize → test → fix →
verify); used
describe/Events for pre-run failures,logs --previousfor crashes — not random commands - Used the per-symptom branch (Pending/ImagePull/CrashLoop/OOM/Evicted/ ConfigError/no-Endpoints/DNS/NetPol/PSA) for a deterministic path
- For distroless
catalog/orders/payments-worker:kubectl debugwith--profile=restricted(or a copy indefault) — neverexec sh; verified the shell-less reality once - Any pod landing in
bookstore(debug/ephemeral/throwaway/restore) is PSA-restricted-compliant (profile or restricted overrides) - Induced failures were imperative / throwaway copies with an explicit
revert each; no
raw-manifests/*edited; cluster verified back to known-good - Fix forward via the declarative source (GitOps re-sync) over live
kubectl edit;kubectl debugRBAC scoped + ephemeral containers audited
Test your understanding¶
Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.
-
A pod shows
Pendingfor ten minutes. Without doing anything else, list the four most likely categories of root cause and the onekubectlcommand that distinguishes between them.
Show answer
The four categories: (1) **No node fits** (insufficient CPU/memory, no node matches affinity/tolerations), (2) **PVC unbound** (the requested StorageClass has no provisioner / volume), (3) **Admission rejection** (PSA, ResourceQuota, webhook), (4) **Image still pulling** (which would be `ContainerCreating`, not Pending — but worth checking). One command: `kubectl describe pod` — the `Events:` section names exactly which of these is blocking, with the specific reason. **Always `describe` first**; logs don't exist for a pod that hasn't started. -
You run
kubectl exec -it catalog-xyz -- shto debug; it returnsexec: "sh": executable file not found in $PATH. What changed about the Bookstore that made this fail, and what's the correct command?
Show answer
catalog/orders/payments-worker run on **distroless** images (`gcr.io/distroless/static:nonroot`) — by design they have no shell, no `ps`, no `curl`, nothing but the Go binary. That's a security feature ([Part 05 ch.03](../05-security/03-supply-chain.md)), not a bug. The correct command is `kubectl debug -it -n bookstore deploy/catalog --image=nicolaka/netshoot --target=catalog --profile=restricted -- /bin/bash` — an **ephemeral container** sharing the pod's process/network namespaces, with debug tools the distroless image lacks, and `--profile=restricted` so PSA admits it. -
A teammate runs
kubectl edit deployment catalogat 2am to mitigate an outage by raising CPU limit. Why is this almost always the wrong action, even when it "fixes" the symptom?
Show answer
Three reasons: (1) **Drift** — the change isn't in Git, so the next Argo CD reconcile (or any future apply from Helm/Kustomize) erases it; with `selfHeal: true` the mitigation is gone within 3 minutes. (2) **No audit trail** — nobody else knows it happened; the postmortem will be confused. (3) **Inhibits root cause** — the symptom is silenced without finding the bug. The right path is to *both* triage (perhaps a temporary HPA bump or replica scale via the proper API) **and** open a PR with the durable fix; let GitOps deploy it. Fix forward via the declarative source. -
Hands-on extension — induce a fail and revert. Run
kubectl set image deploy/catalog catalog=bookstore/catalog:nonexistent -n bookstore. Wait. Observekubectl get podsandkubectl describe pod. Then revert. What does each phase look like?
What you should see
The new ReplicaSet creates pods that go to `ImagePullBackOff` / `ErrImagePull`. `kubectl describe pod` shows: `Failed to pull image "bookstore/catalog:nonexistent": ... not found`. The rolling deployment paces itself — old replicas stay healthy (rolling update keeps available count), so the *service* keeps serving from old pods, but no new pods come up. Revert: `kubectl set image deploy/catalog catalog=bookstore/catalog:dev -n bookstore`. The new ReplicaSet pulls successfully, scales up, the broken one scales down. Cluster is back to known-good with no manifest edited. -
The method is "observe → isolate → hypothesize → test → fix → verify". Which step do most engineers skip under incident pressure, and what's the consequence?
Show answer
**Isolate** — they jump from "observe a 500" straight to "hypothesize the database is down" (or whatever they fixed *last* time). The consequence is spending 30 minutes chasing the wrong layer while the real failure is, say, a NetworkPolicy block or a PSA rejection on a sidecar. The decision tree per symptom exists exactly because isolation should be cheap and deterministic: `kubectl get endpoints svc/catalog` (Service or pods?) takes 5 seconds and rules out half the tree. The method is the discipline that prevents the most expensive failure mode of debugging: working on the wrong thing.
Further reading¶
- Rosso et al., Production Kubernetes, ch.9 — Observability (turning signals into a diagnosis: the operational debugging posture, alert→runbook linkage, and how production teams isolate failures across layers).
- Lukša, Kubernetes in Action 2e — the debugging material across the Pod-lifecycle and Services chapters (ch.5/6 and ch.11): reading Pod status/Events, probes and Endpoints, and why a not-Ready Pod leaves a Service.
- Official: debug running Pods (ephemeral containers &
kubectl debug) https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/, the debug-Pods task https://kubernetes.io/docs/tasks/debug/debug-application/, and troubleshooting Services https://kubernetes.io/docs/tasks/debug/debug-application/debug-service/.