03 — API Priority and Fairness¶
Why one bad client or a hot-looping controller can starve the apiserver for everyone (and how the apiserver protects its own availability): API Priority & Fairness — FlowSchema (match by user/SA/group + verb/resource/namespace,
matchingPrecedence,distinguisherMethod) → PriorityLevelConfiguration (LimitedvsExempt,nominalConcurrencyShares, queuing:queues/handSize/queueLengthLimit,borrowingLimitPercent/lendablePercent); the built-in flow schemas (system, leader-election, workload, catch-all) and where a tenant's bulkLIST/WATCHlands; the APF metrics (apiserver_flowcontrol_request_concurrency_limit,_rejected_requests_total,_current_inqueue_requests,_dispatched_requests_total) + dashboards; the interaction with client-side rate limiting (--kube-api-qps) and the request-priority headers; tuning and a noisy-neighbor scenario — applied by classifying the ch.02 BookstoreTenant operator's apiserver traffic into a bounded custom PriorityLevel + FlowSchema so a buggy reconcile can't take down the control plane.
Estimated time: ~45 min read · ~60 min hands-on
Prerequisites: Part 00 ch.04 — apiserver request path APF protects · Part 11 ch.02 — operator whose traffic this chapter classifies
You'll know after this: • explain how FlowSchema + PriorityLevelConfiguration shape apiserver concurrency · • classify a tenant's noisy LIST/WATCH into a bounded PriorityLevel · • read APF metrics (rejected / inqueue / dispatched) and connect them to client behavior · • tune --kube-api-qps / --kube-api-burst against APF for a controller · • diagnose a noisy-neighbor scenario starving the catch-all level
Why this exists¶
ch.02 shipped a controller. Its production note
warned: "bounded reconcile … never an unbounded hot loop — see ch.03". This
is that chapter. Part 00
ch.04 established that the
API server is the only door — every kubectl, kubelet, and controller goes
through it, and it is horizontally scalable but finite. Part 08
ch.03 and the Part 06
observability chapters
treated "the apiserver is slow / timing out" as a symptom. This chapter is the
mechanism that prevents it and the control surface that diagnoses it.
The failure mode is concrete and common:
- The hot-loop controller. A reconcile bug (no rate limiter, an error that
re-enqueues instantly, a watch that re-LISTs every event) makes one
controller issue thousands of
LIST podsper second. Without APF those requests share the apiserver's inflight budget with everything else —kubectl, the scheduler, kubelets renewing leases — and the whole control plane goes sluggish or starts timing out. A bug in your operator (ch.02) becomes a cluster-wide outage. - The noisy tenant. A tenant's CI sprays
LIST secrets --all-namespacesor a misconfigured client sets--kube-api-qps=1000. One identity consumes the shared budget; well-behaved controllers starve. - The thundering herd after a blip. A network hiccup ends; 500 controllers re-LIST simultaneously. Without fair queueing the apiserver is buried by its own clients reconnecting.
Pre-1.20 the apiserver had only two crude global knobs
(--max-requests-inflight, --max-mutating-requests-inflight) — one slow/
abusive client could exhaust them and there was no isolation. API
Priority & Fairness (APF) replaced that: it classifies every request into a
priority level with its own concurrency share and fair queue, so a
flood in one level cannot starve another. It is on by default and GA since
v1.29 (flowcontrol.apiserver.k8s.io/v1). Operating a cluster — especially
one running custom controllers — without understanding APF means you cannot
explain a 429, cannot protect the control plane from your own operator, and
cannot tune for a multi-tenant or controller-heavy workload. The reference is
Production Kubernetes (control-plane availability / multitenancy) and the
official APF documentation.
Mental model¶
Every request is sorted (by who + what) into exactly one queue with a bounded concurrency share. A flood only fills its own queue and burns its own share; other queues — and the rest of the apiserver — are unaffected. APF is QoS for the API server, the way requests/limits are QoS for nodes.
- Two objects, both built-in (no CRD). A
FlowSchemais the classifier: rules matching the request's subject (user / group / ServiceAccount) and resource/verb/namespace; the highest-matchingPrecedence(lowest number) matching FlowSchema wins and names aPriority LevelConfiguration— the budget + queue. They areflowcontrol.apiserver.k8s.io/v1core kinds, so they dry-run/apply clean on any v1.30+ cluster with nothing installed (the inverse of the CRD-backed files — stated explicitly). - PriorityLevel:
LimitedvsExempt.Exempt= never queued, never throttled (reserved for must-never-be-blocked traffic like leader-election /system:masters— abuse it and you've recreated the old no-isolation problem).Limited= a bounded slice of the apiserver's total concurrency, set bynominalConcurrencyShares(a weight, not an absolute — each level's concurrency = total × its shares / Σ shares).Limitedlevels can lend idle concurrency (lendablePercent) and borrow spare capacity (borrowingLimitPercent) so the floor is guaranteed but idle capacity isn't wasted. - Fair queueing inside a level: shuffle-sharding. A
Limitedlevel withlimitResponse.type: Queuehasqueuesqueues; each request is hashed (by the FlowSchema'sdistinguisherMethod—ByUserorByNamespace) ontohandSizeof them and enqueued on the shortest. This shuffle-sharding makes it statistically very unlikely that one bad flow (one user, one namespace) collides with a good one on all its queues — so a noisy tenant is isolated within the level, not just across levels. PastqueueLengthLimit, excess requests get429 Too Many Requests(with aRetry-After);limitResponse.type: Rejectskips queueing and 429s immediately. - The built-in flow schemas are the defaults you live with. Every cluster
ships FlowSchemas/levels:
exempt(system:masters),system(system:nodes — kubelets),leader-election,workload-leader-election,kube-controller-manager/kube-scheduler,workload-high,workload-low,global-default, and the finalcatch-all(precedence 10000, the lowest-priority bucket for anything unmatched). A custom controller's ServiceAccount, unclassified, falls intoworkload-loworglobal-default— sharing a small share with every other unclassified client. Giving it its own bounded level is the protection this chapter builds. - Server-side APF vs client-side rate limiting are complementary.
--kube-api-qps/--kube-api-burst(and a controller'sclient-gorate limiter) throttle a client before it sends — politeness, easily misconfigured or bypassed. APF throttles at the apiserver — enforcement the client cannot evade. You want both: a sane client limiter and an APF level so a client that ignores its limiter (or a buggy one) still can't starve the control plane.
The trap to keep in view: APF protects the apiserver's availability, not your controller's correctness. If your reconcile hot-loops, APF stops it from taking down the cluster — but your controller is now being throttled/429'd and is not converging. APF buys the blast-radius containment that turns "fix it now, cluster is down" into "fix the bug, the rest of the cluster is fine". The real fix is still a rate-limited, level-triggered, idempotent loop (ch.02); APF is the safety net, not the cure.
Diagrams¶
Diagram A — request → FlowSchema match → PriorityLevel queue → handler or 429 (Mermaid)¶
flowchart TB
req["API request
(subject = user/group/SA,
verb + resource + namespace)"]
subgraph fs["FlowSchema matching (by matchingPrecedence; lowest number wins)"]
f1["exempt (P 1)
system:masters"]
f2["leader-election / system (P ~100-200)
kubelets, lease renew"]
fb["bookstore-operator (P 500)
SA bookstore-operator-controller-manager
distinguisher: ByNamespace"]
f9["service-accounts (P 9000) -> workload-low
global-default (P 9900)"]
fc["catch-all (P 10000)
everything unmatched"]
end
subgraph pl["PriorityLevelConfiguration (budget + fair queue)"]
pe["Exempt
never queued, never throttled"]
pb["bookstore-operator (Limited)
nominalConcurrencyShares: 10
queues 16, handSize 4, qLenLimit 25
lendable 50% / borrow 20%"]
pw["workload-low / global-default (Limited)
shared small share"]
end
handler["apiserver handler
-> authZ/admission/etcd
(Part 00 ch.04 pipeline)"]
r429["429 Too Many Requests
(+ Retry-After) — flow over its
queue limit; OTHER levels unaffected"]
req --> fs
f1 --> pe
f2 --> pe
fb --> pb
f9 --> pw
fc --> pw
pe --> handler
pb -->|"within share / queue"| handler
pb -->|"queue full for THIS flow"| r429
pw -->|"within share"| handler
pw -->|"share exhausted"| r429
Diagram B — the default flow schemas and where the operator lands (ASCII)¶
APF: DEFAULT FLOW SCHEMAS (by precedence) + WHERE THE OPERATOR GOES ───────
precedence FlowSchema -> PriorityLevel who
---------- ------------------------- ----------------- ----------------
1 exempt -> exempt (Exempt) system:masters
~100 system-nodes (system) -> system (Limited*) kubelets
~200 *-leader-election -> leader-election ctrl-mgr/sched lease
~800 kube-system SAs -> workload-high core system SAs
╔════════════════════════════════════════════════════════════════════════╗
║ 500 bookstore-operator ───────-> bookstore-operator ◄═ THIS FILE ║
║ (SA bookstore-operator-controller-manager; (Limited, ║
║ verbs get/list/watch/...; bookstoretenants + bounded, ║
║ ns/cm/svc/deploy) shares 10, ByNamespace) own queue)║
╚════════════════════════════════════════════════════════════════════════╝
9000 service-accounts -> workload-low ◄ where the
9900 global-default -> global-default operator's SA
10000 catch-all -> catch-all (lowest) WOULD land if
NOT classified
(values from `kubectl get flowschema` in step 0 — service-accounts is P9000
-> workload-low, global-default P9900, catch-all P10000)
Unclassified, the BookstoreTenant operator's SA matches the built-in
`service-accounts` FlowSchema (P9000) -> the shared **workload-low** level,
competing with every other SA. A hot-loop bug there throttles that shared
bucket -> collateral damage to OTHER controllers in it. Pinning it to its
OWN precedence-500 Limited level (shares 10, queued, ByNamespace) — far
ahead of P9000 — means a buggy reconcile drains ONLY its own queue & share:
it gets 429'd, the rest of the control plane (kubelets, scheduler, other
controllers, kubectl) is untouched. APF = QoS for the apiserver
(cf. requests/limits = QoS for nodes, Part 01 ch.03).
*system level is Limited but very high-share; conceptually protected.
Hands-on with the Bookstore¶
Assumed working directory: the guide repo root (full-guide/). This
chapter adds the new
examples/bookstore/operator/config/apf/bookstore-operator-apf.yaml
(the operator tree shared with ch.01/ch.02).
It does not modify any canonical Bookstore manifest, Helm chart, Kustomize
overlay, or other examples/bookstore/** file — purely additive.
We will: (0) see the built-in APF state on any cluster; (1) read the operator's custom FlowSchema + PriorityLevel and dry-run it (built-in → clean); (2) apply it and prove the operator's traffic now lands in its level; (3) read the APF metrics; (4) a noisy-neighbor scenario showing isolation.
0. The APF state every cluster already has (no setup)¶
APF is on by default. Inspect the built-in schemas/levels — these exist on a vanilla kind cluster with nothing installed:
kind create cluster --name bookstore
kubectl get flowschema # ~13 built-ins, sorted by matchingPrecedence
# NAME PRIORITYLEVEL MATCHINGPRECEDENCE ...
# exempt exempt 1
# probes exempt 2
# system-leader-election leader-election 100
# workload-leader-election workload-low 200
# ...
# service-accounts workload-low 9000
# global-default global-default 9900
# catch-all catch-all 10000
kubectl get prioritylevelconfiguration
# NAME TYPE NOMINALCONCURRENCYSHARES QUEUES ...
# exempt Exempt
# leader-election Limited 10 16
# workload-high Limited 40 128
# workload-low Limited 100 128
# global-default Limited 20 128
# catch-all Limited 5 -
Note workload-low/global-default (shares 100/20) are where an
unclassified controller's ServiceAccount lands — sharing that bucket with
every other unclassified client (Diagram B).
1. The operator's custom FlowSchema + PriorityLevel (built-in → dry-run clean)¶
config/apf/bookstore-operator-apf.yaml
gives the ch.02 operator's ServiceAccount its
own bounded level. Because FlowSchema/PriorityLevelConfiguration are
built-in flowcontrol.apiserver.k8s.io/v1 kinds, this dry-runs clean
with nothing installed — the explicit inverse of the CRD/webhook-intrinsic
files (ch.01/ch.02);
the file header states this:
kubectl apply --dry-run=client -f examples/bookstore/operator/config/apf/bookstore-operator-apf.yaml
# prioritylevelconfiguration.flowcontrol.apiserver.k8s.io/bookstore-operator created (dry run)
# flowschema.flowcontrol.apiserver.k8s.io/bookstore-operator created (dry run)
# (NO "no matches for kind" — built-in; no operator/CRD needed for THIS file)
The two objects and why each field is exactly this:
kind: PriorityLevelConfiguration # the budget + queue
metadata: { name: bookstore-operator }
spec:
type: Limited # NOT Exempt — a controller must be bounded
limited:
nominalConcurrencyShares: 10 # small WEIGHT: never crowd out workload ctrls
lendablePercent: 50 # lend up to half its idle share to others
borrowingLimitPercent: 20 # may burst 20% over nominal when others idle
limitResponse:
type: Queue # queue the burst rather than instant-429
queuing: { queues: 16, handSize: 4, queueLengthLimit: 25 } # shuffle-shard
---
kind: FlowSchema # the classifier
metadata: { name: bookstore-operator }
spec:
matchingPrecedence: 500 # BEFORE workload-low/global-default/catch-all
priorityLevelConfiguration: { name: bookstore-operator }
distinguisherMethod: { type: ByNamespace } # fair across tenant namespaces
rules:
- subjects: [ { kind: ServiceAccount, serviceAccount:
{ name: bookstore-operator-controller-manager, namespace: bookstore-operator-system } } ]
resourceRules: [ { verbs: ["get","list","watch","create","update","patch","delete"],
apiGroups: ["bookstore.example.com","","apps"],
resources: ["bookstoretenants","bookstoretenants/status","namespaces","configmaps","services","deployments"],
namespaces: ["*"] } ]
type: Limited+ smallnominalConcurrencyShares— a controller must be bounded (neverExempt), and a teaching operator should get a small weight so it can never crowd out the real workload controllers even at full tilt. Concurrency istotal × 10 / Σ all shares.Queue+queues/handSize/queueLengthLimit— burst tolerance with shuffle-sharding (isolates one tenant's churn from another inside this level); past the queue limit it's a clean429 + Retry-After(whichclient-gohonours and backs off on) rather than dropped work.lendablePercent/borrowingLimitPercent— guarantee the operator's floor but don't waste idle capacity: it lends when quiet, borrows a little when others are quiet.matchingPrecedence: 500— below system/leader-election (~100–200) so it can never outrank the control plane's own traffic, but far above theservice-accountsFlowSchema (P9000 →workload-low),global-default(P9900) andcatch-all(P10000) so the operator's requests are classified here instead of falling into the sharedworkload-lowbucket.distinguisherMethod: ByNamespace— the operator manages many tenant namespaces; sharding by namespace keeps one tenant's reconcile churn from starving another within the operator's own level.
2. Apply it and prove the operator's traffic lands in its level¶
# Deploy the operator (ch.02) so its ServiceAccount exists and is making
# apiserver calls, then apply the APF objects:
kubectl apply -k examples/bookstore/operator/config/default # (ch.02; needs the image loaded + cert-manager per ch.01)
kubectl apply -f examples/bookstore/operator/config/apf/bookstore-operator-apf.yaml
kubectl get flowschema bookstore-operator
# NAME PRIORITYLEVEL MATCHINGPRECEDENCE ...
# bookstore-operator bookstore-operator 500
Prove classification without guessing — ask the apiserver which FlowSchema + PriorityLevel a request as the operator's ServiceAccount would match, via the APF dry-run response headers:
kubectl get --raw='/api/v1/namespaces/default/configmaps' \
--as=system:serviceaccount:bookstore-operator-system:bookstore-operator-controller-manager \
-v6 2>&1 | grep -i 'X-Kubernetes-PF'
# X-Kubernetes-PF-FlowSchema-UID: ... (bookstore-operator)
# X-Kubernetes-PF-PriorityLevel-UID: ... (bookstore-operator)
# -> the operator's requests are now classified into ITS bounded level, NOT
# the shared workload-low/global-default bucket.
(Every apiserver response carries X-Kubernetes-PF-FlowSchema-UID /
X-Kubernetes-PF-PriorityLevel-UID — the authoritative "which level did this
request use", far better than inferring.)
3. The APF metrics — the dashboard you watch in production¶
The apiserver exposes APF metrics at /metrics (the same endpoint Part 06
ch.01 scrapes with
Prometheus). The four to know:
kubectl get --raw=/metrics | grep -E '^apiserver_flowcontrol_(request_concurrency_limit|rejected_requests_total|current_inqueue_requests|dispatched_requests_total)' | grep bookstore-operator
# apiserver_flowcontrol_request_concurrency_limit{priority_level="bookstore-operator"} <N>
# -> the concurrency seats currently allotted to this level (moves as levels
# lend/borrow). The level's effective budget right now.
# apiserver_flowcontrol_dispatched_requests_total{priority_level="bookstore-operator"} <C>
# -> requests admitted (served) from this level. Throughput.
# apiserver_flowcontrol_current_inqueue_requests{priority_level="bookstore-operator"} <Q>
# -> requests WAITING in this level's queues right now. Sustained > 0 = the
# level is saturated (the operator is being throttled — investigate the loop).
# apiserver_flowcontrol_rejected_requests_total{priority_level="bookstore-operator",reason="queue-full"} <R>
# -> requests 429'd. Any nonzero, growing value = a hot loop or an
# undersized level. THIS is the alert.
Production rule of thumb (alerts, Part 06
ch.01): alert on
rate(apiserver_flowcontrol_rejected_requests_total[5m]) > 0 for any level you
own, and on apiserver_flowcontrol_current_inqueue_requests staying high — the
first means work is being dropped, the second means a level is saturated.
apiserver_flowcontrol_request_wait_duration_seconds (a histogram) shows how
long requests queue. Grafana's "API server (APF)" dashboards (kube-prometheus)
chart exactly these.
4. Noisy-neighbor scenario — isolation, demonstrated¶
The point of APF is containment. Simulate a hostile bulk client and show it
cannot starve the operator's level (or the control plane). The flood runs
as a different identity — a plain ServiceAccount in default, which the
built-in service-accounts FlowSchema (P9000) classifies into the shared
workload-low level (NOT catch-all — catch-all is only for requests no
FlowSchema matches; an authenticated SA always matches service-accounts
first). Watch the operator's own level stay healthy while workload-low
absorbs and throttles the flood:
# A) A throwaway "noisy" identity hammering LISTs (a stand-in for a buggy
# controller / abusive CI). It is NOT the operator SA, so it does NOT use
# the bookstore-operator level — as a generic SA it matches the built-in
# `service-accounts` FlowSchema (P9000) -> the shared `workload-low` level.
kubectl create serviceaccount noisy -n default
kubectl create clusterrolebinding noisy-view --clusterrole=view --serviceaccount=default:noisy
TOKEN=$(kubectl create token noisy -n default)
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
for i in $(seq 1 2000); do
curl -sk -o /dev/null -H "Authorization: Bearer $TOKEN" \
"$APISERVER/api/v1/configmaps?limit=500" &
done; wait
# B) MEANWHILE the operator's level is isolated — its rejected counter stays
# flat while the noisy traffic gets throttled in workload-low (its bucket):
kubectl get --raw=/metrics | grep 'apiserver_flowcontrol_rejected_requests_total' \
| grep -E 'priority_level="(bookstore-operator|workload-low|global-default|catch-all)"'
# ...{priority_level="workload-low"...} <RISING> ← the flood is contained HERE
# (where the noisy SA landed)
# ...{priority_level="bookstore-operator"...} 0 ← operator UNAFFECTED
kubectl get bookstoretenant -A # still reconciling fine; control plane responsive
The flood is absorbed and 429'd in its own level; the operator's
precedence-500 level — and the kubelets/scheduler/kubectl in theirs — keep
their guaranteed concurrency. That isolation is the entire value proposition.
Clean up:
kubectl delete clusterrolebinding noisy-view; kubectl delete serviceaccount noisy -n default
kubectl delete -f examples/bookstore/operator/config/apf/bookstore-operator-apf.yaml --ignore-not-found
kubectl delete -k examples/bookstore/operator/config/default --ignore-not-found
kind delete cluster --name bookstore
How it works under the hood¶
- Classification then fair dispatch, entirely inside the apiserver. On each
request the apiserver evaluates FlowSchemas in
matchingPrecedenceorder (lowest number first) and the first whosesubjects+resourceRules/nonResourceRulesmatch wins (ties broken by name) — naming aPriorityLevelConfiguration. This happens very early, before the authZ/ admission/etcd pipeline of Part 00 ch.04, so a request that will be queued never even reaches admission until it's dispatched. Each apiserver computes seats independently; with multiple apiservers each enforces its own share of the (per-apiserver) concurrency. nominalConcurrencySharesis a weight; the limit is derived and elastic. ALimitedlevel's nominal concurrency =serverConcurrency × shares / Σ(shares of all Limited levels).lendablePercentlets an idle level donate part of that to busy levels;borrowingLimitPercentcaps how far a busy level may exceed its nominal by borrowing. So a level has a guaranteed floor (its fair share when everyone is busy) but unused capacity is not wasted — the design goal is isolation without hard partitioning.- Shuffle-sharding is why one bad flow doesn't sink good ones. Within a
queued
Limitedlevel, the FlowSchema'sdistinguisherMethod(ByUser/ByNamespace, or none = one flow) yields a flow hash; the request is assigned tohandSizequeues chosen by that hash and enqueued on the shortest. Two distinct flows share allhandSizequeues only with low probability, so a flooding flow's requests pile into its queues and hitqueueLengthLimit(→429) while a quiet flow's requests sail through other queues. It is the same statistical isolation idea as consistent hashing, applied to request fairness.handSizetoo large weakens isolation; too small underuses queues — the defaults are tuned, change them deliberately. Exemptis a loaded gun.Exemptrequests are neither queued nor counted — used for traffic that must never be blocked even under overload (system:masters, health probes, leader election). Putting a busy controller inExemptremoves its bound and reintroduces the pre-1.20 "one client can starve the apiserver" failure — which is precisely why this chapter's operator level isLimitedwith a small share, neverExempt.- APF vs the old inflight flags vs client-side limiting. APF replaced
--max-requests-inflight/--max-mutating-requests-inflightas the unit of isolation (those still cap total server concurrency that APF then subdivides).--kube-api-qps/--kube-api-burstand a controller'sclient-gorate limiter act before the request leaves the client — cooperative and easily wrong (too high, or reset on restart). APF acts at the server — non-bypassable. They compose: the client limiter smooths normal load and is good citizenship; APF is the enforced backstop for when a client is buggy or hostile. The request-priority/fairness response headers (X-Kubernetes-PF-*) let a client observe which level/flow it used — invaluable for debugging "why am I being 429'd". - What APF does and does not fix. It guarantees the apiserver stays
available and fair under a flood — kubelets keep heartbeating, the
scheduler keeps scheduling,
kubectlkeeps working, because each is in a protected level. It does not make a hot-looping controller correct: that controller is now 429'd and not converging, with_rejected_requests_totalclimbing in its level. APF converts "everything is down, fix it now" into "this one controller is throttled, its metrics scream, the rest is fine" — containment, then you still fix the loop (ch.02: rate-limited work-queue, idempotent, boundedRequeueAfter).
Production notes¶
In production: give every first-party controller (and noisy tenant) its own bounded FlowSchema + PriorityLevel. An unclassified controller competes in
workload-low/global-defaultand a bug there is collateral damage to every other controller in that bucket. A dedicatedLimitedlevel (precedence below system, above the shared buckets; modestnominalConcurrencyShares;Queuewith shuffle-sharding) means a buggy reconcile is contained to itself — it gets 429'd, the control plane and other controllers are untouched. This is the ch.02 "bounded reconcile" promise, completed.In production: alert on the APF metrics, don't discover 429s from users. Scrape
apiserver_flowcontrol_*(Part 06 ch.01) and alert onrate(apiserver_flowcontrol_rejected_requests_total[5m]) > 0per owned level, sustainedapiserver_flowcontrol_current_inqueue_requests, and risingapiserver_flowcontrol_request_wait_duration_seconds. A rejected counter climbing in a controller's level is the earliest signal of a hot loop — well before the controller's own lag is visible. Use the kube-prometheus "API server" dashboards.In production: never park busy traffic in
Exempt, and tune shares/queues deliberately.Exemptremoves the bound — it is for system:masters/probes/leader-election only; anExemptcontroller can starve the apiserver exactly as in pre-1.20. SizenominalConcurrencySharesto relative importance (and re-check after adding controllers — shares are relative, so adding a level dilutes everyone), keephandSizenear the default (too large weakens isolation), and preferQueueoverRejectfor controllers (they back off on429; instant reject just busy-loops a naive client).In production: set client-side limits too — APF is the backstop, not the only control. Configure controllers'
client-goQPS/burst (and--kube-api-qpsfor kubelets/components) to sane values so normal load is smooth and polite; rely on APF to contain the misconfigured or buggy client that ignores its own limiter. Both layers, not one. (A controller that setsclient-goQPS absurdly high is a classic self-inflicted apiserver-load incident APF will then 429 — fix the client and keep the APF level.)In production (managed — EKS/GKE/AKS): the provider runs and may tune or restrict APF on the managed apiserver — you typically still create your own
FlowSchema/PriorityLevelConfiguration(they're regular API objects), but the provider's defaults and the apiserver's total concurrency are theirs. Your operators' traffic counts against a control plane you don't size, so a dedicated bounded level for each controller is more important on managed clusters, and the APF metrics are often surfaced in the provider's control-plane monitoring — use them.
Quick Reference¶
# Inspect APF (works on any cluster — APF is on by default, GA v1.29)
kubectl get flowschema --sort-by=.spec.matchingPrecedence
kubectl get prioritylevelconfiguration
kubectl get flowschema <FS> -o yaml # rules / subjects / distinguisher
kubectl get --raw=/debug/api_priority_and_fairness/dump_priority_levels # live seats/queues
kubectl get --raw=/debug/api_priority_and_fairness/dump_requests # in-flight by flow
# Which level did a request use? (authoritative — response headers)
kubectl get --raw=/api/v1/namespaces/default/configmaps \
--as=system:serviceaccount:<NS>:<SA> -v6 2>&1 | grep -i 'X-Kubernetes-PF'
# The four metrics to alert on (scrape via Prometheus — Part 06 ch.01)
kubectl get --raw=/metrics | grep -E '^apiserver_flowcontrol_(request_concurrency_limit|dispatched_requests_total|current_inqueue_requests|rejected_requests_total)'
# Apply the operator's bounded level (built-in objects — dry-run/apply CLEAN)
kubectl apply -f examples/bookstore/operator/config/apf/bookstore-operator-apf.yaml
Minimal FlowSchema + PriorityLevel skeleton (full file in
examples/bookstore/operator/config/apf/):
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: PriorityLevelConfiguration
metadata: { name: my-controller }
spec:
type: Limited # NEVER Exempt for a controller
limited:
nominalConcurrencyShares: 10 # relative weight (re-check when adding levels)
lendablePercent: 50
borrowingLimitPercent: 20
limitResponse:
type: Queue # Queue (back-pressure) > Reject for controllers
queuing: { queues: 16, handSize: 4, queueLengthLimit: 25 }
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata: { name: my-controller }
spec:
matchingPrecedence: 500 # < system (~100-200), > workload-low/catch-all
priorityLevelConfiguration: { name: my-controller }
distinguisherMethod: { type: ByNamespace } # or ByUser
rules:
- subjects: [ { kind: ServiceAccount, serviceAccount: { name: <SA>, namespace: <NS> } } ]
resourceRules: [ { verbs: ["get","list","watch","create","update","patch","delete"],
apiGroups: ["<GROUP>",""], resources: ["<CRD-PLURAL>","configmaps"], namespaces: ["*"] } ]
Checklist:
- Understand APF = classify (FlowSchema, by subject+resource,
matchingPrecedence) → bound + fair-queue (PriorityLevel,LimitedvsExempt, shares, shuffle-sharding) — QoS for the apiserver - Every first-party controller has its own bounded
Limitedlevel (precedence below system, above shared buckets) — not the default bucket - Never
Exemptfor busy traffic;Queue(notReject) for controllers;nominalConcurrencySharessized and re-checked when adding levels (shares are relative) - APF metrics scraped + alerted:
rejected_requests_totalrate > 0, sustainedcurrent_inqueue_requests, risingrequest_wait_duration - Client-side limits (
--kube-api-qps,client-gorate limiter) set and APF as the non-bypassable backstop — both layers - FlowSchema/PriorityLevelConfiguration are built-in
(
flowcontrol.apiserver.k8s.io/v1) — dry-run/apply clean, no CRD (the inverse of ch.01/ch.02) - APF contains the blast radius; the real fix for a hot loop is still a rate-limited, idempotent, level-triggered controller (ch.02)
Test your understanding¶
Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.
-
What does APF buy you that
--max-requests-inflightdoes not?
Show answer
`--max-requests-inflight` is a single global counter — one noisy client can consume the whole budget and starve everyone else. APF partitions the inflight budget into priority levels with fair queuing within each level, classified by FlowSchema. A bulk LIST from one tenant lands in the workload-low queue; leader-election lands in system; a misbehaving controller lands in its own level. Each level has guaranteed concurrency shares and bounded queue lengths, so one starving level can't starve another. APF is about *isolation*, not just throttling. -
Your custom operator is hot-looping;
apiserver_flowcontrol_rejected_requests_totalis rising. Walk through the diagnosis steps.
Show answer
First, `kubectl get --raw /metrics | grep flowcontrol_rejected_requests` and group by `priority_level` and `flow_schema` — which FlowSchema is matching the offending requests, which level is rejecting? Second, look at `apiserver_flowcontrol_current_inqueue_requests` for that level — queues are growing because the level is at its concurrency limit. Third, identify the client by `User-Agent` (kube-audit log) or `flowcontrol.apiserver.k8s.io/v1` debug endpoints. Fourth, decide: classify the offender into its own FlowSchema with a tighter level (containment), and ALSO fix the controller (rate-limit, level-trigger, observedGeneration guard — see [ch.02](02-operator-development.md)). APF buys you breathing room, not a fix. -
A teammate sets
--kube-api-qps=1000on their controller "to make it fast." What does APF do to this controller?
Show answer
Client-side `--kube-api-qps` lets the controller fire 1000 requests/second toward the apiserver. APF classifies each request to a FlowSchema → PriorityLevel; once that level's concurrency is full, additional requests go to the queue (bounded by `queueLengthLimit`), and when the queue overflows, requests are rejected with `429 Too Many Requests`. The controller's client-go rate limiter then backs off. Outcome: the controller doesn't crash the apiserver, but it does waste its own CPU spinning and burns its retry budget. The right pattern is sane client-side limits (`--kube-api-qps=20-50` for most controllers) and APF as the non-bypassable backstop. -
Hands-on: deploy a tight loop that runs
kubectl get pods --all-namespaces -win 50 parallel processes. Watchapiserver_flowcontrol_current_inqueue_requestsandapiserver_request_duration_seconds. What FlowSchema absorbs the load?
What you should see
The requests land in `workload-low` or `global-default` by default (because they are authenticated user requests not matched by a more specific schema). The queue depth grows, request latency climbs, and once the level's queue is full, you see 429 responses. Meanwhile system-level traffic (leader-election, system-leader-election) is unaffected — its dedicated FlowSchema/PriorityLevel keeps the control plane operating. This is the isolation property in action.
Further reading¶
- Rosso et al., Production Kubernetes, ch.1 ("A Path to Production") & ch.12 ("Multitenancy"): protecting control-plane availability and isolating tenants/clients at the API server — the operational frame for why APF exists and how it fits a multi-tenant, controller-heavy platform.
- Ibryam & Huß, Kubernetes Patterns 2e — Elastic Scale (ch.29) & Predictable Demands (ch.2): bounding and declaring resource demands as a pattern — APF is that discipline applied to the API server's own concurrency (the node-QoS analogy of Part 01 ch.03).
- Official: API Priority and Fairness
https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
(FlowSchema/PriorityLevelConfiguration, the built-in schemas, shuffle-sharding,
and the
apiserver_flowcontrol_*metrics + debug endpoints).