04 — Autoscaling¶
HPA v2 in depth (metric types Resource/ContainerResource/Pods/Object/ External, the
ceil(current·curMetric/targetMetric)algorithm, stabilization/behavior/tolerance), VPA (recommender/updater/admission; modes; the HPA-conflict), Cluster Autoscaler vs Karpenter (node scaling), and KEDA (event-driven, wraps an HPA, scale-to-zero) — applied by adding the payments-worker Deployment, an HPA on catalog, and a KEDA ScaledObject scaling payments-worker on RabbitMQ queue depth.
Estimated time: ~30 min read · ~90 min hands-on
Prerequisites: Part 01 ch.03 — requests/limits feed HPA & VPA · Part 06 ch.01 — metrics-server and custom metrics are the HPA's input · Part 04 ch.01 — node scaling is which nodes the scheduler can use
You'll know after this: • derive HPA replica counts from ceil(current·curMetric/targetMetric) and tune behavior/stabilization · • choose between HPA, VPA, Cluster Autoscaler, Karpenter and KEDA for a workload · • avoid the HPA + VPA conflict and use VPA in recommender mode safely · • write a KEDA ScaledObject for RabbitMQ queue depth with scale-to-zero · • add payments-worker, an HPA on catalog and queue-driven scaling to the Bookstore
Why this exists¶
The Bookstore has fixed replica counts: catalog 3, orders 2, payments-worker soon. A fixed count is wrong both ways. Under a traffic spike (a sale), 3 catalog Pods saturate and latency/errors climb — the ch.01 RED signals go red. At 3am with no traffic, those same Pods sit idle burning cluster capacity you pay for (ch.06). And the queue consumer is worse: when no orders flow, a running payments-worker does nothing but cost money; when a flood of orders hits, one consumer cannot drain the queue and payments lag.
Autoscaling makes capacity track demand: more Pods when busy, fewer (or zero) when idle, and more nodes when the Pods don't fit. This is the Elastic Scale pattern. It has three independent layers that must be understood separately: scale the Pods of a workload (HPA / VPA / KEDA), and scale the nodes of the cluster (Cluster Autoscaler / Karpenter). They compose; they do not substitute.
Mental model¶
Four mechanisms, four jobs:
- HorizontalPodAutoscaler (HPA) — more/fewer replicas of a workload
based on a metric. A control loop: observe a metric, compute a desired
replica count, patch
.spec.replicas. The default for stateless, request-serving services (catalog). GA atautoscaling/v2. - VerticalPodAutoscaler (VPA) — bigger/smaller requests for a workload's containers based on observed usage. Right-sizes one Pod rather than adding Pods. An add-on (not built-in). Its recommendation mode is the sizing tool for ch.06.
- KEDA — event-driven horizontal scaling, including to zero, on external signals (queue length, stream lag, cron, …). KEDA does not replace the HPA — it creates and drives one, adding scale-from/to-zero and dozens of scalers. The right tool for the payments-worker (scale on RabbitMQ queue depth).
- Cluster Autoscaler / Karpenter — more/fewer nodes. When HPA/KEDA create Pods that don't fit, the cluster autoscaler adds nodes; when nodes are underused it removes them. Cluster Autoscaler grows fixed node groups; Karpenter provisions right-sized nodes just-in-time from instance-type flexibility.
The HPA algorithm — memorise it, it explains every behaviour:
desiredReplicas = ceil[ currentReplicas × (currentMetricValue / targetMetricValue) ]
evaluated per metric; with multiple metrics the HPA computes a desired count for each and uses the maximum (never the sum — it scales for the most-stressed dimension). A tolerance (default 10%) suppresses scaling when the ratio is within ±10% of 1.0 (no thrashing around the target). KEDA's managed HPA runs this same formula on the queue-length metric.
Diagrams¶
The HPA control loop (Mermaid)¶
flowchart TD
ms["metrics source
metrics-server (Resource)
/ custom / external adapter"]
hpa["HPA controller
(every ~15s)"]
calc["desired = ceil(current x
curMetric / targetMetric)
per metric -> take MAX"]
tol{"within tolerance
(±10%)?"}
beh["apply behavior:
stabilization window
+ scale policies"]
scale["patch Deployment
.spec.replicas"]
rs["ReplicaSet adds/removes
Pods -> scheduler places"]
ms --> hpa --> calc --> tol
tol -- yes --> hpa
tol -- no --> beh --> scale --> rs
rs -. "new metric reading" .-> ms
KEDA: RabbitMQ queue depth → ScaledObject → managed HPA → payments-worker (Mermaid)¶
flowchart LR
ord["orders Pods
publish to queue"]
mq["RabbitMQ
'orders' queue (depth N)"]
keda["KEDA operator
+ metrics adapter"]
so["ScaledObject
(83-): trigger rabbitmq
value=20, min=0 max=10"]
hpa["managed HPA
(KEDA creates it)"]
pw["payments-worker
Deployment (19-)"]
ord --> mq
so --> keda
keda -->|reads queue depth N| mq
keda -->|feeds external metric| hpa
hpa -->|"ceil(replicas x N/20)"| pw
keda -->|"N==0 for cooldown -> scale to 0;
N>=1 -> scale 0->1"| pw
pw -->|consume, drain| mq
Which autoscaler for what (ASCII)¶
LAYER SCALES TRIGGER BOOKSTORE USE
─────────────────────────────────────────────────────────────────────────────
HPA replicas CPU / mem / custom / catalog (CPU + req/s)
external metric orders (CPU) — optional
VPA requests observed usage history sizing tool, ch.06
(recommend | auto) (recommend mode on catalog)
KEDA replicas events: queue len, payments-worker
(incl. 0) stream lag, cron, ... (RabbitMQ queue depth)
Cluster nodes unschedulable Pods / conceptual (cloud);
Autoscaler underused nodes kind has fixed nodes
Karpenter nodes unschedulable Pods -> conceptual (AWS); JIT,
just-in-time right-size instance-type flexible
RULE: HPA and VPA must NOT both act on the SAME resource metric (they fight).
HPA scales out; VPA(recommend) sizes; CA/Karpenter add nodes for them.
Hands-on with the Bookstore¶
Assumed working directory: the guide repo root (full-guide/).
We will: (1) add the payments-worker Deployment (its first manifest); (2) add an HPA on catalog; (3) install KEDA and add a ScaledObject scaling payments-worker on RabbitMQ queue depth; (4) load-test both with a restricted-compliant public-image generator.
0. Prerequisites (self-bootstrapping)¶
Bring up the cluster + Bookstore as in ch.01 step 0 including metrics-server (the HPA's CPU metric needs it). Also apply the async backends:
kubectl apply -f examples/bookstore/raw-manifests/13-rabbitmq.yaml
kubectl rollout status deployment/rabbitmq -n bookstore
# (metrics-server install: ch.01 step 1 — required for the HPA CPU metric.)
1. Add the payments-worker Deployment¶
The worker has had source (app/payments-worker/main.go) since Part 00 but no
manifest — KEDA needs a real Deployment to scale. New files:
- A dedicated
payments-worker-saadded toexamples/bookstore/raw-manifests/05-serviceaccounts-rbac.yaml(automountServiceAccountToken: false, no Role/RoleBinding — the worker never calls the API; same posture asorders-sa). examples/bookstore/raw-manifests/19-payments-worker-deploy.yaml— the Deployment. Shape (full restricted SC like catalog,AMQP_URL→ rabbitmq Service, scheduling layer consistent with siblings):
# (excerpt — condensed for the chapter; the authoritative full spec, incl.
# probe tuning fields, is examples/bookstore/raw-manifests/19-payments-worker-deploy.yaml)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-worker
namespace: bookstore
labels: { app: payments-worker, component: worker, app.kubernetes.io/part-of: bookstore }
spec:
replicas: 1 # resting state; KEDA (83-) OWNS the count
selector: { matchLabels: { app: payments-worker } }
template:
metadata:
labels: { app: payments-worker, component: worker }
spec:
serviceAccountName: payments-worker-sa
automountServiceAccountToken: false
securityContext: # PSA restricted (pod) — same as catalog 10-
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
seccompProfile: { type: RuntimeDefault }
priorityClassName: bookstore-critical # checkout path: above batch
topologySpreadConstraints: # spread KEDA-scaled replicas
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # availability > strict spread
labelSelector: { matchLabels: { app: payments-worker } }
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector: { matchLabels: { app: payments-worker } }
containers:
- name: payments-worker
image: bookstore/payments-worker:dev
imagePullPolicy: IfNotPresent
securityContext: # PSA restricted (container)
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
ports: [{ name: metrics, containerPort: 8080 }] # /healthz + /metrics
env:
- { name: PORT, value: "8080" }
- { name: LOG_LEVEL, value: "info" }
- name: AMQP_URL # the async edge
value: "amqp://guest:guest@rabbitmq.bookstore.svc.cluster.local:5672/"
# Only /healthz exists (a consumer has no /readyz); ALL THREE probes
# use it (full failureThreshold/timeout tuning is in the saved 19-).
startupProbe: { httpGet: { path: /healthz, port: metrics }, periodSeconds: 5, failureThreshold: 30 }
livenessProbe: { httpGet: { path: /healthz, port: metrics }, periodSeconds: 10 }
readinessProbe: { httpGet: { path: /healthz, port: metrics }, periodSeconds: 5 }
lifecycle: { preStop: { sleep: { seconds: 5 } } } # native; distroless-safe
resources:
requests: { cpu: 50m, memory: 64Mi }
limits: { cpu: 250m, memory: 128Mi }
volumeMounts: [{ name: tmp, mountPath: /tmp }] # RO-rootfs scratch
volumes: [{ name: tmp, emptyDir: { sizeLimit: 32Mi } }]
terminationGracePeriodSeconds: 30
Decisions (documented in the manifest header). No DB_DSN — the worker never touches Postgres (contrast catalog/orders). Only
/healthz— a queue consumer has no inbound traffic and no/readyz(app/payments-worker/main.go); both probes hit/healthz, which answers iff the process and its consume loop are alive.topologySpreadScheduleAnyway(notDoNotSchedulelike catalog/storefront) — a KEDA-scaled replica must never be leftPendingfor spread; draining the queue wins.replicas: 1is just the resting state — KEDA owns the count; do not also hand-write an HPA for it (two controllers fighting.spec.replicas). A headless metrics-only Service was added to40-services.yaml(clusterIP: None) so the ch.01 ServiceMonitor can scrapepayments_processed_total— the worker takes no request traffic, so there is nothing to load-balance.
60-networkpolicy.yaml
was extended (both-ends, additive): rule 6's rabbitmq ingress now also
admits app: payments-worker, and a new rule 9 grants payments-worker
egress to rabbitmq:5672 (DNS egress is the pre-existing rule 2, which
already selects every Pod). All 8 prior policies are otherwise unchanged.
kubectl apply -f examples/bookstore/raw-manifests/05-serviceaccounts-rbac.yaml
kubectl apply -f examples/bookstore/raw-manifests/19-payments-worker-deploy.yaml
kubectl apply -f examples/bookstore/raw-manifests/40-services.yaml
kubectl apply -f examples/bookstore/raw-manifests/60-networkpolicy.yaml # if a policy CNI runs
kubectl rollout status deployment/payments-worker -n bookstore
# The bookstore ns is enforce:restricted — this Pod is admitted with ZERO
# PodSecurity warnings (proven by a server-side dry-run; same method Part 05
# ch.02 used for catalog/orders).
2. An HPA on catalog (CPU + a custom req/s metric)¶
New file
examples/bookstore/raw-manifests/82-hpa-catalog.yaml
— a built-in autoscaling/v2 object (no CRD; dry-runs cleanly):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: catalog, namespace: bookstore }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: catalog }
minReplicas: 2 # user-facing: never below 2 (stay HA)
maxReplicas: 6 # bounded by the namespace ResourceQuota (00-)
metrics:
- type: Resource # needs metrics-server (ch.01)
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
- type: Pods # ILLUSTRATIVE: needs the Prometheus Adapter
pods:
metric: { name: http_requests_per_second }
target: { type: AverageValue, averageValue: "50" }
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- { type: Percent, value: 100, periodSeconds: 60 } # at most 2x …
- { type: Pods, value: 2, periodSeconds: 60 } # … or +2 Pods
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # scale IN slowly (anti-thrash)
policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
Honest about the second metric (as ch.03 was about app instrumentation). The CPU metric works with just metrics-server. The Pods custom metric (
http_requests_per_second) requires the Prometheus Adapter to publish it throughcustom.metrics.k8s.iofrom the catalog metrics ch.01 already exports — without that adapter the HPA reports<UNKNOWN>for it and scales on CPU alone. The manifest header says so; it is included to show a real multi-metric HPA and the RED "Rate" signal as a scaling input.
kubectl apply -f examples/bookstore/raw-manifests/10-catalog-deploy.yaml # the target
kubectl apply -f examples/bookstore/raw-manifests/82-hpa-catalog.yaml
kubectl get hpa catalog -n bookstore -w # TARGETS shows current/target
kubectl describe hpa catalog -n bookstore # per-metric values + events
3. KEDA: scale payments-worker on RabbitMQ queue depth¶
Install KEDA via its Helm chart into its own keda namespace (its own
namespace — not PSA-restricted; fine, exactly like the observability stacks in
ch.01). Helm (not a raw
releases/latest/download/<PINNED-FILE>.yaml URL — that 404s the moment a
newer KEDA ships, since latest/ resolves to the newest tag but the filename
is version-pinned) is the same install method this part uses for
metrics-server's add-on peers, kube-prometheus-stack, Loki, Alloy, Tempo and
the OTel Collector:
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda --create-namespace --wait
kubectl -n keda rollout status deployment/keda-operator
New file
examples/bookstore/raw-manifests/83-keda-scaledobject.yaml
(a Secret + TriggerAuthentication + ScaledObject). Core object:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: payments-worker, namespace: bookstore }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: payments-worker }
pollingInterval: 15 # check queue depth every 15s
cooldownPeriod: 60 # 60s idle before scaling to 0
minReplicaCount: 0 # SCALE TO ZERO when the queue is empty
maxReplicaCount: 10 # bounded by the namespace ResourceQuota (00-)
triggers:
- type: rabbitmq
metadata:
protocol: amqp # read queue length over AMQP (5672)
queueName: orders # the queue orders publishes / worker consumes
mode: QueueLength
value: "20" # target msgs/replica -> desired=ceil(len/20)
activationValue: "1" # 0 -> 1 as soon as ≥1 message waits
authenticationRef: { name: rabbitmq-trigger-auth }
KEDA wraps an HPA — see it.
83-does not contain an HPA, yet:kubectl apply -f examples/bookstore/raw-manifests/83-keda-scaledobject.yaml kubectl get scaledobject,hpa -n bookstore # NAME ... scaledobject.keda.sh/payments-worker # NAME ... horizontalpodautoscaler.../keda-hpa-payments-worker <-- KEDA made thisKEDA's operator created a managed HPA whose external metric is the live "orders" queue length; that HPA runs the same
ceil(replicas × len/value)formula as the catalog HPA. The piece a plain HPA cannot do — 0↔1 — is done by KEDA itself (it patches the Deployment to 0 aftercooldownPeriodofqueue ≤ activationValue, and back to 1 the instant a message arrives).CRD-backed — expected dry-run note.
ScaledObject/TriggerAuthenticationare KEDA CRDs (keda.sh/v1alpha1). Without KEDA installed,kubectl apply --dry-run=client -f 83-...printsno matches for kind "ScaledObject" in version "keda.sh/v1alpha1"— expected, not a defect, exactly like the Gateway API (Part 02 ch.05) and Kyverno (Part 05 ch.03) objects, and like80-/81-from ch.01. The19-Deployment and the82-HPA are built-ins and do dry-run cleanly.
4. Load-test both (restricted-compliant generator)¶
Drive catalog with HTTP and orders to fill the queue, from public-image
load generators. PSA restricted is satisfied via --overrides (pinned
rakyll/hey / curlimages/curl, both run fine non-root — pinned tags, never
:latest, per this guide's image policy).
First, the catalog HPA (works as-is):
# Hammer catalog -> CPU/req-rate rises -> the catalog HPA scales out (watch §2).
kubectl run hey -n bookstore --rm -it --restart=Never \
--image=ghcr.io/rakyll/hey:0.1.4 \
--overrides='{"spec":{"securityContext":{"runAsNonRoot":true,"runAsUser":65532,"seccompProfile":{"type":"RuntimeDefault"}},"containers":[{"name":"hey","image":"ghcr.io/rakyll/hey:0.1.4","args":["-z","2m","-c","50","http://catalog.bookstore.svc.cluster.local/books"],"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"readOnlyRootFilesystem":true}}]}}'
Then KEDA — but enable order publishing FIRST (read this before flooding).
The canonical
14-orders-deploy.yaml
leaves AMQP_URL unset (Part-02 networking scaffolding), so orders
silently drops events and nothing ever enqueues — the flood below would
do nothing to the queue and KEDA would never scale. Point orders at the same
rabbitmq Service the worker uses (this is a demo-only env override on the
live Deployment, not an edit to 14-; revert it afterwards):
# DEMO-ONLY: make orders actually publish to the 'orders' queue.
kubectl set env deployment/orders -n bookstore \
AMQP_URL="amqp://guest:guest@rabbitmq.bookstore.svc.cluster.local:5672/"
kubectl rollout status deployment/orders -n bookstore
# now: flood orders -> the 'orders' queue grows -> KEDA scales payments-worker
# from 0/1 up toward maxReplicaCount; watch it drain then scale back to 0.
kubectl run flood -n bookstore --rm -it --restart=Never \
--image=curlimages/curl:8.10.1 \
--overrides='{"spec":{"securityContext":{"runAsNonRoot":true,"runAsUser":65534,"seccompProfile":{"type":"RuntimeDefault"}},"containers":[{"name":"flood","image":"curlimages/curl:8.10.1","command":["sh","-c","for i in $(seq 1 500); do curl -s -o /dev/null -XPOST http://orders.bookstore.svc.cluster.local/orders -d \"{\\\"book_id\\\":1,\\\"qty\\\":1}\"; done; echo done"],"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"readOnlyRootFilesystem":true}}]}}'
watch kubectl get hpa,scaledobject,deploy/payments-worker,deploy/catalog -n bookstore
# Revert the demo-only change (restore the canonical AMQP_URL-unset behaviour):
kubectl set env deployment/orders -n bookstore AMQP_URL-
kubectl rollout status deployment/orders -n bookstore
(14- is intentionally left unchanged on disk — AMQP_URL arrives there
permanently in Part 07 when the broker
gets real credentials; this chapter only flips it at runtime to demonstrate
the scaler, then reverts.)
How it works under the hood¶
- The HPA loop. The
horizontal-pod-autoscalercontroller (in kube-controller-manager) reconciles each HPA every--horizontal-pod-autoscaler-sync-period(~15s): it reads each metric (Resource/ContainerResource frommetrics.k8s.io; Pods/Object fromcustom.metrics.k8s.io; External fromexternal.metrics.k8s.io), computesceil(currentReplicas × curMetric/targetMetric)per metric, takes the max, applies tolerance (skip if within ±10% of target) and behavior (stabilization window = the max desired over the trailing window for scale-down; scale policies bound the step), then patchesscale.spec.replicas. The Deployment's ReplicaSet does the rest. It needs a metrics API server registered or the metric reads as<UNKNOWN>. - Metric types.
Resource= utilization vs the container request (so request size changes the effective target — interacts with VPA);ContainerResource(GA 1.30) targets a named container's resource (avoids a sidecar skewing the Pod average);Pods= a per-pod average of a custom metric;Object= a single metric off one object;External= a metric with no Kubernetes object (queue length) — the type KEDA feeds. - VPA internals. Three components: the recommender consumes history
(from metrics-server/Prometheus) and computes target/lowerBound/upperBound
requests; the updater evicts Pods whose requests are out of bounds; the
admission webhook rewrites requests on (re)creation. Modes:
Off(recommend only — the sizing tool, ch.06),Initial(set on create only),Auto/Recreate(evict to resize — note: in-place resize is maturing but VPA classically restarts the Pod). - Why HPA + VPA on the same metric conflict. HPA scales replicas on, say,
CPU utilization; VPA in
Autorewrites the CPU request. But utilization is usage ÷ request — VPA changing the request moves the very denominator the HPA divides by, so they chase each other (oscillation). Supported combinations: HPA on CPU/mem with VPA inOff(recommend) — or HPA on a custom/external metric (e.g. req/s, queue depth) with VPA on CPU/mem (different signals). The Multidimensional Pod Autoscaler (GKE) and in-place resize aim to make this safer; the conflict rule still holds by default — hence the explicit warning in82-'s header. - KEDA = activation + a managed HPA. The KEDA operator watches
ScaledObjects. For each it (a) creates an HPA with an External metric served by KEDA's own metrics adapter (the live trigger value, e.g. queue length) — so 1→N scaling is ordinary HPA math; and (b) handles activation: while the trigger is inactive (≤ activationValueforcooldownPeriod) KEDA scales the Deployment tominReplicaCount(which can be 0) and removes the HPA's lower bound; the first event reactivates it. An HPA alone can never reach or leave 0 — that 0↔1 edge is exactly KEDA's added value, and why a queue consumer (idle most of the night) is its canonical use case. - Node autoscaling closes the loop. HPA/KEDA create Pods; if no node has
room they stay
Pending. Cluster Autoscaler watches for unschedulable Pods and grows a node group (and scale down nodes whose Pods can be safely moved, respecting PDBs — ch.05). Karpenter instead provisions a right-sized node per the pending Pods' exact requests/constraints from a broad instance-type set (faster, less waste, no pre-defined groups). Both react to the scheduler's verdict (Part 04 ch.01) — they are the node tier under the Pod tier.
Production notes¶
In production: HPA for stateless request services, KEDA for event/queue consumers. Sizing a queue worker by CPU is a lagging, wrong proxy — scale it by the backlog (queue length / lag). The Bookstore does exactly this: catalog→HPA(CPU+req/s), payments-worker→KEDA (RabbitMQ depth) with scale-to-zero so an idle consumer costs nothing (ch.06).
In production: never run HPA and VPA on the same resource metric. Use VPA in recommendation mode to set good requests (ch.06) and HPA to scale out on CPU/custom/external — different jobs, different signals. Letting both rewrite/divide CPU produces replica + request oscillation.
In production: tune
behaviordeliberately. The most common HPA failure is scaling in too aggressively (a brief dip drops replicas, the next spike has none, latency craters). Scale up fast, scale down slow (a longscaleDown.stabilizationWindowSeconds, as in82-). Always setminReplicas ≥ 2for anything user-facing — HA, not just load.In production: pick the node autoscaler for the platform. EKS: Cluster Autoscaler on managed node groups, or Karpenter (AWS-native, JIT right-sized nodes, spot-friendly — ch.06). GKE: cluster autoscaler / node auto-provisioning (and the MPA for the HPA+VPA case). AKS: cluster autoscaler. Pod autoscaling (HPA/KEDA/VPA) is portable; the node layer is provider-specific. Size
maxReplicasagainst the namespace ResourceQuota (Part 01 ch.03 / ch.06) so scaling can't exhaust the cluster or the bill.In production: the custom/external metrics path is real infrastructure. An HPA on req/s or queue depth needs the Prometheus Adapter or KEDA registered as a metrics API server, monitored like any other dependency — if the metrics API is down the HPA freezes at its last replica count (it does not scale to zero or to max).
Quick Reference¶
kubectl get hpa -n <NS> -w # TARGETS = cur/target per metric
kubectl describe hpa <H> -n <NS> # events: why it (didn't) scale
kubectl get scaledobject -n <NS> # KEDA objects
kubectl get hpa -n <NS> | grep keda-hpa- # the HPA KEDA created
kubectl top pods -n <NS> # the metric source (metrics-server)
kubectl get deploy <D> -n <NS> -o jsonpath='{.spec.replicas}' # current count
Minimal HPA + KEDA ScaledObject skeleton:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: <D> }
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
scaleTargetRef: { name: <D> }
minReplicaCount: 0 # KEDA-only: scale to zero
maxReplicaCount: 10
triggers:
- type: rabbitmq
metadata: { protocol: amqp, queueName: <q>, mode: QueueLength, value: "20" }
authenticationRef: { name: <TRIGGERAUTH> }
Checklist:
- HPA
desired = ceil(cur × curMetric/target); multi-metric → max -
minReplicas ≥ 2for user-facing;behaviorscales down slowly - HPA and VPA never on the same resource metric
- Queue/event consumers scaled by backlog via KEDA (scale-to-zero)
-
maxReplicasbounded by the namespace ResourceQuota - metrics-server (Resource) / adapter / KEDA metrics API installed & monitored
- A node autoscaler (Cluster Autoscaler / Karpenter) for the Pod growth
Test your understanding¶
Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.
-
Catalog is running 4 replicas, CPU utilization averages 80% across them. The HPA target is 70%. Without doing any math, will it scale up or down, and to what? Now do the math.
Show answer
Scale **up**. `desired = ceil(4 × 80/70) = ceil(4.57) = 5`. The 10% tolerance check: is `|80/70 - 1| = 0.143` > 0.10? Yes — scaling proceeds. Memorising `ceil(currentReplicas × currentMetric / targetMetric)` lets you predict every HPA decision; HPA describe events that don't match this formula mean a different metric (multi-metric → max wins) or stabilization is suppressing the change. -
An engineer adds VPA in
Automode and HPA on CPU to the same Deployment. A week later replica counts oscillate wildly and pods are constantly evicted. What's happening?
Show answer
The HPA-VPA conflict on the **same** resource metric (CPU): VPA right-sizes the requests, which moves the HPA's *target* utilization, which changes the HPA's desired replicas, which changes per-replica load, which changes VPA's recommendation. They fight. Solution: either run VPA in **`Off` (recommender)** mode and let humans approve resizes, *or* HPA on a *different* metric (RPS, queue depth) so VPA owns CPU sizing and HPA owns scale-out. See §Mental model / VPA-conflict. -
Payments-worker reads from a RabbitMQ queue. At 3am the queue is empty. Why is HPA on CPU the wrong scaler, and what does KEDA do differently?
Show answer
An idle consumer at 3am uses ~0 CPU regardless of queue depth — HPA on CPU sees "low utilization" and could scale all the way down to `minReplicas: 1`, but cannot scale **to zero** (HPA's minimum is 1). KEDA exposes the queue depth as an external metric *and* manages the HPA, including the scale-from-zero / scale-to-zero transitions (which HPA alone cannot do). KEDA is the right tool for queue/event-driven work; HPA on CPU is the wrong scaler for the wrong signal. -
Hands-on extension — make KEDA scale to zero. Apply a
ScaledObjectwithminReplicaCount: 0for payments-worker on RabbitMQ. Drain the queue. Watchkubectl get deploy payments-worker -w. Now publish one message. Time how long it takes for the first replica to be Ready.
What you should see
After ~30s of idle, replicas drop to 0 — KEDA scales the Deployment to zero via the managed HPA. On the new message, KEDA detects the queue depth, scales to 1, and you watch a Pod transition Pending → ContainerCreating → Running → Ready, typically 10-30s. That delay is the **cold-start cost** of scale-to-zero; size your `pollingInterval` (default 30s) and `cooldownPeriod` against your SLO. For latency-sensitive paths, `minReplicaCount: 1` is the right trade. -
HPA at maxReplicas=50, but the cluster only has nodes for 30 catalog pods. What happens when traffic spikes, and what other component must be in play?
Show answer
HPA scales the *replica count* up to 50, but the scheduler can only place 30 — the other 20 stay `Pending` with `FailedScheduling: 0/X nodes available`. That's where the **node autoscaler** (Cluster Autoscaler or Karpenter) comes in: it watches Pending Pods and provisions more nodes. HPA scales Pods; Karpenter/CA scales nodes; you need both. Without a node autoscaler the HPA's `maxReplicas` is whatever the cluster's current capacity allows — and the cluster never grows on its own.
Further reading¶
- Ibryam & Huß, Kubernetes Patterns 2e — Elastic Scale (ch.29) (HPA/ VPA/cluster autoscaling as one elasticity pattern; horizontal vs vertical trade-offs).
- Rosso et al., Production Kubernetes, ch.13 — Autoscaling (HPA/VPA/ cluster autoscaler in a production platform, metrics pipelines, pitfalls).
- Official: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ (and the HPA algorithm: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details), and KEDA: https://keda.sh/docs/latest/concepts/scaling-deployments/.