06 — Model serving and inference¶
The serving problem (low-latency request/response over HTTP/gRPC, autoscaling on request load rather than CPU, scale-to-zero, model versioning, A/B + canary at the model level); KServe (
InferenceService+ServingRuntime: serverless via Knative or rawDeployment,predictor/transformer/explainer,storageUri, built-in runtimes for sklearn / pytorch / tensorflow / triton / huggingface) installed via pinned Helm in its own namespace; Seldon Core (graph-of-components serving) and NVIDIA Triton (multi-framework, multi-model, ensemble) at the right depth — a "when which" matrix; autoscaling for serving: Knative request-driven scale-to-zero vs HPA on CPU (deepening Part 06 ch.04) vs KEDA on custom RPS/latency (ties Part 06 ch.04); GPU serving (Triton / TGI / vLLM on GPU; MIG for multi-tenant inference — ties ch.02); model canary/A-B/shadow at the InferenceService level vs Part 07 ch.05 Argo Rollouts (contrast: model canary is request-driven via KServe, not replica-driven); ties Part 07 ch.04 GitOps so model rollouts are PRs — applied by installing KServe (Knative Serving + cert-manager) into its own namespace and serving the trained recommendations model both as a KServeInferenceServiceAND as a plain Deployment + Service, inexamples/bookstore/ml/serve/, with thecatalog/storefrontintegration described.
Estimated time: ~60 min read · ~120 min hands-on
Prerequisites: Part 12 ch.04 — model artifact this chapter serves · Part 06 ch.04 — autoscaling primitives Knative/HPA/KEDA build on · Part 07 ch.05 — replica-driven canary contrasted with request-driven KServe canary
You'll know after this: • author a KServe InferenceService with a built-in ServingRuntime + storageUri · • compare KServe / Seldon / Triton for a given multi-framework workload · • configure scale-to-zero serving via Knative + a fall-back HPA/KEDA path · • design model A/B + canary at the InferenceService level · • serve a GPU model with Triton/TGI/vLLM and reason about MIG for multi-tenant inference
Why this exists¶
This is the serve side of the dev → train → serve loop: notebooks
(./05-notebooks-and-interactive.md)
explore, training (./04-distributed-training.md)
produces model.joblib, and this chapter exposes it.
ch.04 ended with a model.joblib on a PVC.
That is not a product. Serving is the bridge from "we trained a thing"
to "the Bookstore's catalog and storefront make a request and get
recommendations back fast enough that the page does not stutter". The
serving workload has three properties no other Bookstore workload combines:
- Latency-sensitive — the recommendation lands inside a user-facing request budget (milliseconds). Part 06 ch.04 gave us HPA on CPU, KEDA on event backlog — neither is the right target metric for inference. The right target is requests-in-flight / p95 latency.
- Load-spiky — a recommendation endpoint follows the user-traffic curve, not a steady RPS. Idle for hours, then a spike. Scale-to-zero is the cost lever — no replicas while no requests are coming in.
- Versioned-not-mutated — "Roll the model back to yesterday's v6" should be a PR, not a hot-edit. Models are artifacts with versions; the serving layer is a router over those versions; canary / A/B / shadow is traffic splitting at the model boundary, not a replica-rollout (which is what Part 07 ch.05 Argo Rollouts does for app code).
The Kubernetes-native answer to all three is KServe (InferenceService +
ServingRuntime): a CRD that is a model endpoint, autoscaled by
requests (Knative serverless mode), with versioned storageUris and
declarative traffic splits. It is to inference what a Deployment+Service+HPA
is to a stateless app — a level above the primitives that knows about
models. Alternatives: Seldon Core (richer graph-of-components,
explainers, drift detectors), NVIDIA Triton (high-throughput
multi-framework runtime), and — for the recommender's tiny CPU case — a
plain Deployment + Service + HPA, which the chapter also ships so the
serving path actually runs on kind without KServe installed.
Mental model¶
Serving = a versioned model artifact + a runtime that loads it + a request-driven autoscaler in front. KServe wraps all three as one CRD.
- A model is an artifact at a URI. A
model.joblib(sklearn), amodel.pt(PyTorch), asaved_model/(TF), a*.safetensors(transformers), all live somewhere: an object store (s3://,gs://,oci://), a PVC (pvc://), an OCI artifact, a Hugging Face hub URL. The serving runtime (sklearn server, TorchServe, TF Serving, Triton, vLLM, TGI) is a container that loads that URI on startup and serves it over HTTP/gRPC. InferenceService+ServingRuntimedecouple "what model" from "what runtime". A KServeServingRuntimeis a cluster-scoped (or namespaced) recipe: image, supportedmodelFormats (sklearn / pytorch / tensorflow / xgboost / huggingface / triton / vllm / tgi), container shape. AnInferenceServiceis the per-model resource: which runtime to use, whichstorageUrito fetch the model from, autoscaling bounds, traffic splits. The same runtime serves many InferenceServices; the same InferenceService can be retargeted to a newer model URI by a PR. (Custom-predictor mode bypasses the runtime catalogue — you pointpredictor.containers[]at your own image; we use that in the Bookstore tree because our model is a custom joblib schema.)- Autoscaling: requests, not CPU. Three options for "scale serving":
- Knative request-driven (
min-scale: 0/target: 10concurrent requests / pod): scale-to-zero, scale-up on request, ideal for spiky serving. KServe's default in serverless mode. - HPA on CPU (Part 06 ch.04): the boring, robust choice — no scale-to-zero, no Knative install, works on the plain Deployment fallback.
- KEDA on custom metric (RPS, p95 latency from Prometheus / OTel):
serves the same "scale on real signal" goal as Knative request-driven,
but without the Knative cold-start mechanics. Pick when you want
min-scale > 0always-warm pods AND request-aware scaling. - Canary / A/B at the model boundary is request-routing, not replica
rollout. Part 07 ch.05
Argo Rollouts splits replicas between versions of the same Deployment
(a small fraction of Pods run v2, traffic is split at the Service). At
the model boundary the runtime is the same — what differs is which
model URI it loads. KServe's canary field
(
canaryTrafficPercent: 20) literally adds a second Revision in the Knative model and splits traffic by request. You can do replica-based canary too, but for inference it is usually less precise. A/B is the same primitive at 50/50. Shadow is "send a copy of every request to the candidate, do not show its response to the user" — a KServe annotation; useful for evaluating a model against real traffic without user-visible risk. - GPU serving is fractional and batched. Triton can serve N
models on one GPU concurrently (dynamic batching across requests),
MIG (ch.02) gives you isolated GPU
slices per tenant. vLLM / TGI are LLM-specialised runtimes that
pack many concurrent generations onto one GPU. Same Kubernetes
primitives (
limits.nvidia.com/gpu, taint/affinity to GPU pool); same PSA-restrictedrequirement; the new piece is that one inference Pod may concurrently serve many models. Out of scope for the Bookstore (the recommender is sklearn-tiny) but the pattern is identical.
Diagrams¶
Request -> KServe InferenceService (transformer -> predictor -> explainer) -> autoscaler (Mermaid)¶
flowchart TB
client["catalog / storefront
HTTP POST /v1/models/recommender:predict"]
gw["Knative / Istio gateway
(or Ingress for raw deployment mode)"]
is["InferenceService recommender
(serving.kserve.io/v1beta1)"]
tx["transformer (optional)
request preprocessing
(e.g. feature lookup)"]
pred["predictor
= ServingRuntime sklearn
OR custom container
loads model.joblib"]
exp["explainer (optional)
e.g. SHAP / Alibi"]
autos["request-driven autoscaler
(Knative concurrency target
OR HPA on CPU
OR KEDA on RPS / latency)"]
pods["1..N predictor Pods
scale-to-zero when idle"]
client --> gw --> is
is --> tx --> pred
pred --> exp
pred --> autos --> pods
KServe vs Seldon Core vs Triton vs plain Deployment (ASCII)¶
WHEN WHICH
────────────────────────────────────────────────────────────────────────────
KServe CRD-FIRST. `InferenceService` + `ServingRuntime`.
(kserve.io) Serverless via Knative (scale-to-zero, request
autoscale) OR raw Deployment mode. Built-in
runtimes for sklearn / pytorch / tensorflow /
xgboost / huggingface / triton / vllm. Canary +
A/B + shadow at the InferenceService level.
= the Kubernetes-native default; we lead with it.
Seldon Core GRAPH-OF-COMPONENTS. `SeldonDeployment` (v1) or
(seldon-core.io) `Model` (v2). Multiple predictor/transformer/
explainer/router nodes wired in a DAG. Drift +
outlier detection built in. = "I need an
inference graph, not a single model" or you
specifically standardised on Seldon.
NVIDIA Triton HIGH-THROUGHPUT MULTI-MODEL RUNTIME. One Triton
(Triton Inference Pod serves N models from N frameworks (TensorRT,
Server) ONNX, PyTorch, TF, Python backend), with dynamic
batching + concurrent execution + ensembles. Use
as a ServingRuntime under KServe, OR stand-alone
via Triton's own Helm chart. = the right runtime
when you need GPU throughput, NOT a competitor
to KServe at the API layer.
plain Deployment APP-FIRST. Deployment + Service + HPA. No CRDs,
+ Service + HPA no Knative, no operator. = the right fit for a
tiny CPU model where serverless/canary are
overkill, and the kind-runnable fallback in
this guide.
OUR BOOKSTORE TREE
recommender-deployment.yaml + recommender-service.yaml (kind-runnable)
recommender-inferenceservice.yaml (CRD-backed; needs KServe + Knative)
Both consume the SAME image bookstore/recommender-serve:dev and the SAME
model.joblib produced by ch.04's recommender-train Job.
Hands-on with the Bookstore¶
Assumed working directory: the guide repo root (full-guide/). Requires
the PSA-restricted bookstore-ml namespace (ch.01)
and the model.joblib artifact from ch.04.
This chapter installs KServe (with Knative Serving + cert-manager, all
pinned, own namespaces), serves the recommendations model as a
KServe InferenceService AND as a plain Deployment + Service, and
describes how the existing catalog / storefront services call it
without modifying the canonical app.
Two serving paths, both real. The plain Deployment + Service is the kind-runnable path — built-ins, no CRDs, dry-runs cleanly. The KServe InferenceService is the CRD-backed equivalent — adds scale-to-zero + canary/A-B + the v1/v2 inference protocol routing. Both consume the SAME container image (
bookstore/recommender-serve:dev) and the SAMEmodel.joblib. The chapter teaches both because both are real production choices for this size of model.
1. Install KServe (pinned; own namespace)¶
KServe has a few install footprints. The full serverless install needs Knative Serving + cert-manager + a Knative network layer (Istio or Kourier). The RawDeployment install drops Knative for "just Deployments+HPA under InferenceService" (lighter, no scale-to-zero). We show serverless because the chapter teaches scale-to-zero; mark RawDeployment as the alternative.
# Pin — bump deliberately. KServe ships a "quick install" script AND a
# Helm-chart-per-component. Use the Helm-chart-per-component path: it is
# pinned, scriptable, and matches the guide's "pinned Helm" rule.
KSERVE_VERSION="0.13.1"
KNATIVE_VERSION="v1.15.0"
CERT_MANAGER_VERSION="v1.15.3"
# 1) cert-manager (KServe's webhook needs serving certs)
helm install cert-manager \
oci://quay.io/jetstack/charts/cert-manager \
--version "$CERT_MANAGER_VERSION" \
-n cert-manager --create-namespace \
--set crds.enabled=true --wait
# 2) Knative Serving (the serverless layer KServe uses by default).
# Knative ships its install as a "core" YAML + a "net-*" YAML; both are
# pinned to a release tag (not :latest).
kubectl apply --server-side \
-f https://github.com/knative/serving/releases/download/knative-${KNATIVE_VERSION}/serving-crds.yaml
kubectl apply --server-side \
-f https://github.com/knative/serving/releases/download/knative-${KNATIVE_VERSION}/serving-core.yaml
# Network layer — Kourier is the smallest:
kubectl apply --server-side \
-f https://github.com/knative/net-kourier/releases/download/knative-${KNATIVE_VERSION}/kourier.yaml
kubectl patch configmap/config-network -n knative-serving --type=merge \
-p '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'
# 3) KServe itself
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
--version "$KSERVE_VERSION" -n kserve --create-namespace --wait
helm install kserve oci://ghcr.io/kserve/charts/kserve \
--version "$KSERVE_VERSION" -n kserve --wait
kubectl get pods -n cert-manager
kubectl get pods -n knative-serving
kubectl get pods -n kserve
kubectl api-resources | grep -E 'kserve|knative'
# inferenceservices, servingruntimes, clusterservingruntimes (serving.kserve.io/v1beta1)
# services, revisions, configurations, routes (serving.knative.dev/v1)
RawDeployment alternative. If scale-to-zero / Knative is not wanted, set the InferenceService annotation
serving.kserve.io/deploymentMode: RawDeploymentand skip the Knative install — KServe will create a plain Deployment + HPA. The InferenceService manifest stays nearly identical; the chapter teaches the serverless path because that is what makes KServe meaningfully different from a plain Deployment.
2. Serve via plain Deployment + Service (the kind-runnable path)¶
This is the path that just works on kind, no KServe required. The
recommender-deployment.yaml
+ recommender-service.yaml
files are built-in apps/v1 Deployment + v1 Service — they dry-run
cleanly anywhere. The same image (bookstore/recommender-serve:dev from
predictor.py) implements
the v1 prediction protocol on port 8080.
# Build + load the serve image (same image used by the InferenceService below)
docker build -t bookstore/recommender-serve:dev examples/bookstore/ml/serve
kind load docker-image bookstore/recommender-serve:dev # if using kind
# Requires the recommender-model PVC from ch.04 — applied + populated by
# recommender-train-job.yaml. The PVC is shared (RWO; for prod use RWX).
kubectl apply -f examples/bookstore/ml/serve/recommender-deployment.yaml
kubectl apply -f examples/bookstore/ml/serve/recommender-service.yaml
kubectl rollout status deploy/recommender -n bookstore-ml --timeout=120s
# Try it
kubectl port-forward -n bookstore-ml svc/recommender 8080:8080 &
curl -s http://localhost:8080/ready
# {"status":"ready"}
curl -s -X POST http://localhost:8080/v1/models/recommender:predict \
-H 'content-type: application/json' \
-d '{"instances":[{"book_id":1,"k":3}]}'
# {"predictions":[{"book_id":1,"k":3,
# "recommendations":[{"book_id":2,"score":0.547...,
# "title":"Book 0002","author":"Author 031"},
# {"book_id":3,"score":0.453...,...},
# {"book_id":4,"score":0.412...,...}]}]}
curl -s "http://localhost:8080/recommend?book_id=42&k=5"
# {"book_id":42,"k":5,"recommendations":[...]}
Add an HPA on CPU (the Part 06 ch.04 default) for this Deployment if you want autoscaling without Knative:
kubectl autoscale deploy/recommender -n bookstore-ml \
--cpu-percent=60 --min=1 --max=5
kubectl get hpa -n bookstore-ml
3. Serve via KServe InferenceService (CRD-backed; the serverless path)¶
recommender-inferenceservice.yaml
is the CRD-backed equivalent. With KServe installed it adds (over the
plain Deployment) scale-to-zero (min-scale: 0), request-driven autoscaling
(autoscaling.knative.dev/target: 10 concurrent / pod), traffic splits
for canary/A-B, and the v1/v2 inference-protocol routing. We use the
custom-predictor option (the same image as the plain Deployment) so
the recommender's custom joblib schema works; KServe's built-in
sklearn ServingRuntime is shown as an OPTION B stub at the bottom of
the manifest (it requires train.py to dump a sklearn estimator
directly).
# Without KServe installed, a client dry-run prints the documented
# CRD-intrinsic message (the same precedent as raw-manifests/51-/70-/83-,
# argocd/, operators/, chaos/, ml/batch/, ml/train/):
kubectl apply --dry-run=client \
-f examples/bookstore/ml/serve/recommender-inferenceservice.yaml
# error: ... no matches for kind "InferenceService" in version
# "serving.kserve.io/v1beta1" — expected, schema-correct.
# With KServe installed (step 1):
kubectl apply -f examples/bookstore/ml/serve/recommender-inferenceservice.yaml
kubectl get inferenceservice -n bookstore-ml
# NAME URL READY
# recommender http://recommender.bookstore-ml.svc.cluster.local True
# The Knative Revision the InferenceService is wrapping:
kubectl get revisions,services.serving.knative.dev -n bookstore-ml
# Scale-to-zero: stop calling, wait, watch replicas go to 0.
# Scale-up-from-zero: make one request, see a cold start (first request
# slower while a pod spins up).
kubectl get pods -n bookstore-ml -w
4. Canary / A-B at the InferenceService level¶
This is the field that turns "deploy a new model" from a config dance into a one-line PR. With the v1 model already serving 100% of traffic:
# A canary InferenceService update: route 20% of traffic to a new model URI.
spec:
predictor:
canaryTrafficPercent: 20
minReplicas: 0
maxReplicas: 5
containers:
- name: kserve-container
image: bookstore/recommender-serve:dev
env: [{ name: MODEL_DIR, value: /workspace/model-v2 }]
# ... same SC, etc. The v2 PVC is the artifact from the v2 training run.
canaryTrafficPercent: 20 tells KServe to keep the previous Revision live
and shift 20% of requests to the new one. Bump to 50, then 100, then
delete the old Revision — exactly the
Part 07 ch.05 progressive
delivery pattern, but at the request boundary, not the replica
boundary. Shadow mode (annotation
serving.kserve.io/canaryTrafficPercent: "0" with both Revisions live and
a mirrorPercent on the route) sends a copy of each request to the
candidate without showing the response — for measuring v2's behaviour on
real traffic before any user sees it.
5. The catalog / storefront integration¶
The recommender's in-cluster DNS is
recommender.bookstore-ml.svc.cluster.local:8080. From the catalog
service (../examples/bookstore/app/catalog/main.go),
the "you might also like" rail on a book page is a single
GET /recommend?book_id=<ID>&k=5 call:
# Curl from inside a debug pod in the bookstore namespace, with the
# storefront / catalog SA — no canonical app change:
kubectl debug -n bookstore --image=curlimages/curl:8.10.1 \
--profile=restricted -it nonroot \
-- curl -s "http://recommender.bookstore-ml.svc.cluster.local:8080/recommend?book_id=1&k=3"
# {"book_id":1,"k":3,"recommendations":[...]}
The chapter narrative shows the integration; the canonical app
(../examples/bookstore/app/, ../examples/bookstore/raw-manifests/,
../examples/bookstore/helm/, ../examples/bookstore/kustomize/) is
not mutated — the recommendations endpoint is a new dependency, and
a real change would be a regular app feature PR through Parts 07/08.
6. A small load test against the plain Deployment¶
# A short, restricted-compliant load-test Pod that hammers the predictor.
# Uses curlimages/curl (non-root, no shell required, restricted-friendly).
# The plain Deployment will trip HPA on CPU and scale up; on the KServe
# path the Knative concurrency target triggers scale-up instead.
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: recommender-loadgen
namespace: bookstore-ml
labels: { app.kubernetes.io/part-of: bookstore-ml }
spec:
restartPolicy: Never
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
runAsUser: 65532
seccompProfile: { type: RuntimeDefault }
containers:
- name: load
image: curlimages/curl:8.10.1
command: ["/bin/sh","-c"]
args:
- |
for i in $(seq 1 500); do
curl -s -o /dev/null \
"http://recommender.bookstore-ml.svc.cluster.local:8080/recommend?book_id=$((1 + i % 50))&k=5"
done
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
EOF
kubectl get hpa,pods -n bookstore-ml -w
How it works under the hood¶
- The KServe
InferenceServicereconcile. When you apply anInferenceService, the KServe controller looks atspec.predictor.model.modelFormat.nameand picks a matchingServingRuntime(sklearn / pytorch / tensorflow / huggingface / triton / vllm / xgboost), OR usesspec.predictor.containers[]for a custom predictor. It also looks atspec.predictor.minReplicas / maxReplicas / annotations[autoscaling.knative.dev/...]. In serverless mode it creates a KnativeService->Configuration->Revision->Deployment. In RawDeployment mode it creates a plainDeployment+HorizontalPodAutoscaler. The model artifact atstorageUriis fetched on Pod startup by a storage-initializer init-container (KServe ships one per supported URL scheme: s3, gs, pvc, oci, hf, http(s)); the artifact lands at/mnt/modelsinside the predictor container by convention, and the runtime image loads from there. For custom predictor mode there is no init-container — your image loads the model itself from a path you mount. - Knative autoscaling, concretely. The Knative-Serving Pod
Autoscaler (KPA) sits in front of the Deployment, watches concurrency
/ RPS at the activator, and scales by:
target_concurrency(e.g. 10 in-flight requests per pod).min-scale: 0is the scale-to-zero — when concurrency stays at 0 for the stable window, KPA scales to 0 and traffic for the next request lands on the activator (a buffering proxy) which holds the request until a Pod is ready. Cold-start latency = (Pod schedule) + (container start) + (model load), so for a large model this matters; KServe documentscontainerConcurrency/target-utilization-percentage/target-burst-capacitylevers to tune. The Bookstore recommender cold-starts in seconds because the model is small. - HPA vs KEDA vs Knative — pick the right one. HPA on CPU (or
on a custom metric via metrics-server / Prometheus adapter) is the
default, robust choice; min-replicas is always > 0; works on any
Deployment. KEDA (Part 06 ch.04)
wraps an HPA and scales from a custom external scaler — RPS,
latency, queue depth, Prometheus query — and can scale to 0 if you
set
minReplicaCount: 0(KEDA does the deactivation; Knative is not required). Knative is request-driven internal autoscaling and baked into KServe serverless. Pick: Knative-when-using-KServe- serverless; KEDA for everything else that needs scale-to-zero; HPA-CPU for the simple stateless-ML case (which is what the Bookstore's plain-Deployment path does). - Canary / A-B / shadow at the InferenceService level. KServe
models a
predictoras aDefaultand (optionally) aCanaryRevision. ThecanaryTrafficPercentfield tells Knative to route N% of requests to Canary, (100-N)% to Default. Because Knative is request-routed, the split is exact (not the replica-bucket approximation of Part 07 ch.05 Argo Rollouts). Shadow mirroring is themirrorPercentfield on the KnativeRoute(set indirectly via KServe annotations) — the candidate gets a copy of the request, its response is discarded. Argo Rollouts is still right for the predictor image's rollout (a CVE patch in the runtime container is a code rollout, not a model rollout); the two coexist — Rollouts rolls the runtime, KServe splits between models. - GPU serving: Triton + MIG + vLLM/TGI. A
ServingRuntimefor Triton is a Triton container that loads N models from a model repository; one InferenceService -> one Triton Pod with many models. GPU access is exactly the ch.02 shape:limits.nvidia.com/gpu: 1, taint+affinity to the GPU pool, PSA-restrictedSC. MIG lets you slice an A100/H100 GPU into isolated halves/quarters/sevenths so multiple Triton Pods (or multiple InferenceServices) share one physical GPU without time- slicing — the right call when many small models compete for one big GPU. vLLM and TGI are LLM-specific runtimes that pack many concurrent generations onto one GPU with continuous batching; KServe ships them as built-inServingRuntimes. Out of scope for the Bookstore (the recommender is sklearn-tiny) but the pattern is the same. - Seldon Core and when to pick it. Seldon Core (v2) models a serving
artifact as a
Model(one URI + runtime), aPipelineas a graph ofModels withinputs/outputs(so a request flows feature-lookup -> model-A -> router -> model-B -> postprocess as a DAG), andExperimentfor traffic-splitting between graphs. It shipsDetectorCRDs for drift + outlier, which KServe does viaexplainerand Alibi-Detect components but less first-classly. Pick Seldon when you genuinely have a graph (e.g. fraud detection — feature store -> rule engine -> model -> override) or you standardised on it; pick KServe when the unit of work is a single model endpoint (the vastly more common case). - GitOps for models.
Part 07 ch.04 said the cluster
state lives in Git, reconciled by Argo CD or Flux. For models that
means: the
InferenceServicemanifest lives in Git, itsstorageUriandimageare the versioned fields, a model promotion (v1 -> v2) is a PR. The retrain pipeline (ch.04) produces a new artifact at a new URI; CI writes a PR that bumpsstorageUriandcanaryTrafficPercent: 20; review/merge -> Argo CD applies -> KServe shifts traffic. Roll back: revert the PR; Argo CD reapplies; KServe shifts back. Models are not special infrastructure any more — they are config, in Git, like everything else. - PSA on every serving Pod. KServe wraps the predictor in a Knative
Service which spawns a Deployment which spawns Pods — your SC on the
InferenceService propagates all the way through. The Knative
activator and queue-proxy sidecars come with their own SC (KServe and
Knative make them PSA-
restricted-admissible); your predictor container must too. The serving image baked here (../examples/bookstore/ml/serve/Dockerfile) runs as uid 65532 + non-root.
Production notes¶
In production: pick KServe + Knative serverless for spiky request-driven workloads where scale-to-zero saves real money; pick KServe + RawDeployment (or a plain Deployment + HPA) for steady always-on inference where Knative cold-start risk outweighs the cost savings. Document the choice; do not run two serving stacks on one cluster by accident.
In production: install KServe + Knative + cert-manager via pinned Helm charts / pinned release-tag YAMLs in their own namespaces (
kserve,knative-serving,cert-manager). Treat upgrades like any control-plane component. Never install from areleases/latestURL.In production: the model artifact lives in an object store (or a registry), not on a per-cluster PVC. Use the Pod's cloud identity (Part 10 ch.03) to fetch from s3/gs/oci — never static keys in a Secret. The InferenceService's
storageUriis the versioned address; promotions are PRs (GitOps, Part 07 ch.04).In production: canary at the InferenceService boundary with small
canaryTrafficPercentfirst (5%, 20%, 50%, 100%), with automated metrics-driven promotion (the Part 07 ch.05 AnalysisTemplate pattern applied to inference SLOs: p95 latency, error rate, and a domain-specific quality signal like CTR uplift on shadow traffic). For high-risk changes use shadow before any percent goes live.In production: observability — every InferenceService exposes Prometheus metrics (request count, latency histogram per revision, model load time, error counters) via Knative's queue-proxy and KServe's own exporters. Wire them into the Part 06 ch.01 Prometheus + the dashboards. The new model-specific metrics: prediction latency p50/p95/p99, batch utilisation (Triton), GPU utilisation (DCGM, ch.02), and a quality signal (a periodic eval job, scored against a held-out set, posted as a metric).
In production: ML pods are not exempt from PSA. The predictor container, the transformer, the explainer, the storage- initializer init-container — all of them carry restricted SC. Triton's and vLLM's official images default to root; the SC must still be applied and the USER pinned. The CRD-intrinsic dry-run rule applies to
InferenceService/ServingRuntime(their headers document this, like every CRD in this guide).
Quick Reference¶
# Install KServe stack (pinned; own namespaces)
KSERVE_VERSION="0.13.1"
KNATIVE_VERSION="v1.15.0"
CERT_MANAGER_VERSION="v1.15.3"
helm install cert-manager oci://quay.io/jetstack/charts/cert-manager \
--version "$CERT_MANAGER_VERSION" -n cert-manager --create-namespace \
--set crds.enabled=true --wait
kubectl apply --server-side \
-f https://github.com/knative/serving/releases/download/knative-${KNATIVE_VERSION}/serving-crds.yaml
kubectl apply --server-side \
-f https://github.com/knative/serving/releases/download/knative-${KNATIVE_VERSION}/serving-core.yaml
kubectl apply --server-side \
-f https://github.com/knative/net-kourier/releases/download/knative-${KNATIVE_VERSION}/kourier.yaml
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
--version "$KSERVE_VERSION" -n kserve --create-namespace --wait
helm install kserve oci://ghcr.io/kserve/charts/kserve \
--version "$KSERVE_VERSION" -n kserve --wait
# Build + load the serve image, then deploy both paths
docker build -t bookstore/recommender-serve:dev examples/bookstore/ml/serve
kind load docker-image bookstore/recommender-serve:dev
# Plain Deployment fallback (the kind-runnable path)
kubectl apply -f examples/bookstore/ml/serve/recommender-deployment.yaml
kubectl apply -f examples/bookstore/ml/serve/recommender-service.yaml
kubectl autoscale deploy/recommender -n bookstore-ml --cpu-percent=60 --min=1 --max=5
# CRD-backed InferenceService (needs KServe)
kubectl apply -f examples/bookstore/ml/serve/recommender-inferenceservice.yaml
kubectl get inferenceservice,revisions,services.serving.knative.dev -n bookstore-ml
# Test
kubectl port-forward -n bookstore-ml svc/recommender 8080:8080 &
curl -s -X POST http://localhost:8080/v1/models/recommender:predict \
-H 'content-type: application/json' \
-d '{"instances":[{"book_id":1,"k":3}]}'
Minimal skeleton (KServe InferenceService, custom predictor, serverless):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: recommender
namespace: bookstore-ml
annotations:
autoscaling.knative.dev/min-scale: "0"
autoscaling.knative.dev/max-scale: "5"
autoscaling.knative.dev/target: "10"
spec:
predictor:
minReplicas: 0
maxReplicas: 5
canaryTrafficPercent: 20 # OPTIONAL — for canary rollouts
securityContext:
runAsNonRoot: true
runAsUser: 65532
seccompProfile: { type: RuntimeDefault }
containers:
- name: kserve-container
image: <YOUR-PREDICTOR-IMAGE>
ports: [{ containerPort: 8080, name: http1 }]
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
Checklist:
- KServe + Knative + cert-manager installed via pinned Helm/manifest, own namespaces
- Custom predictor OR built-in
ServingRuntime(sklearn / pytorch / huggingface / triton / vllm) chosen deliberately -
storageUripoints at an object store (cloud) or PVC (kind), not baked into the image - Autoscaling target chosen explicitly: Knative request-driven (serverless) OR HPA-CPU OR KEDA (rps/latency)
- Canary / A/B / shadow at the InferenceService level for model rollouts; Argo Rollouts for runtime rollouts
- PSA-
restrictedSC on every predictor / transformer / explainer container - Observability: KServe + Knative metrics into Prometheus (Part 06 ch.01), plus a quality metric
- CRD-backed manifests carry the CRD-intrinsic header note
- Plain Deployment + Service + HPA is a valid choice for small CPU models — and the kind-runnable fallback
Test your understanding¶
Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.
-
What does KServe give you that a
Deployment + Service + HPAdoes not?
Show answer
(1) A standard model-loading contract via `storageUri` — fetch from S3/GCS/PVC at startup; (2) built-in `ServingRuntime`s for sklearn / pytorch / triton / huggingface / vllm so you don't write a server per framework; (3) `predictor`/`transformer`/`explainer` separation for pre/post-processing and explainability without inlining into model code; (4) Knative-backed scale-to-zero for bursty inference; (5) traffic split at the InferenceService level for model canary (v1: 90%, v2: 10%); (6) standard `/v1/models/:predict` API surface. For a small CPU model with steady traffic, plain Deployment is fine. For multi-model / multi-framework / canary / scale-to-zero / GPU, KServe earns its keep. -
Your KServe
InferenceServicehasscaleTarget: 5andscaleMetric: rps, but during a traffic spike to 200 RPS the pod count stays at 1 with high latency. What's wrong?
Show answer
Likely (a) the Knative autoscaler stabilization-window is 60s by default — short spikes don't trigger scaling; check `KPA` config. (b) The `containerConcurrency` is unset, so Knative thinks one pod handles unbounded concurrent requests — set it to a realistic value (e.g. 10) so KPA targets the right pod count. (c) Cluster autoscaler/Karpenter can't provision new nodes fast enough — pods schedule on existing nodes only. (d) RPS metric isn't being scraped — verify `kpa.knative.dev/metric` annotations. Most common: `containerConcurrency` not set, so KPA's math is wrong. -
You want to canary v2 of the recommender at 5% traffic. Walk through the InferenceService canary syntax and what happens if v2's p99 latency triples.
Show answer
Set `predictor.canaryTrafficPercent: 5` and the InferenceService spec stays on v1 with the new model URI pinned to canary. KServe routes 5% via the Knative `Revision` for v2, 95% to v1. If v2's p99 triples, the *user-facing* SLO degrades on 5% of requests; if you have an SLO gate (Prometheus alert + Argo Rollouts AnalysisTemplate), the rollout pauses and surfaces the breach. Without a gate, you stay at 5% until a human notices. The discipline: every canary must have a quality gate (latency, error rate, business metric like "prediction acceptance rate"), not just "deploy and watch." -
What's the difference between scale-to-zero via Knative and HPA
minReplicas: 0?
Show answer
HPA `minReplicas: 0` doesn't actually work for HPA itself — HPA can't scale a Deployment to zero. KEDA can (via the `ScaledObject` CRD wrapping HPA) for queue-driven workloads. Knative scales request-driven workloads to zero by inserting an Activator proxy: when a request arrives at a zero-pod Revision, the Activator buffers the request, scales the Revision up, then forwards. Cold-start latency is the trade — the first request after a quiet period takes seconds. For online serving with bursty traffic and idle periods, scale-to-zero saves cost. For low-latency-required serving (storefront recommendations), keep `minScale: 1` and accept the cost floor. -
Hands-on: deploy a tiny sklearn model via KServe with
scaleMetric: rpsandcontainerConcurrency: 10. Fire 50 concurrent requests viahey -z 30s -c 50. What scale do you see, and what does the latency graph look like?
What you should see
At `containerConcurrency: 10` and 50 concurrent, KPA targets 5 pods. Initial pod count is 1; the autoscaler ramps to 5 within ~30s (Knative panic mode can ramp faster under load). Latency spikes during ramp-up (the single pod is saturated), then settles. After the load stops, pods scale back to `minScale` (often 0 for serverless mode) within `scaleDownDelay`. The latency curve looks like a ramp-up spike then plateau — the cold-start tax you pay for the cost savings of scale-to-zero.
Further reading¶
- Rosso et al., Production Kubernetes, ch.13 — "Autoscaling" and ch.7 — "Workload Runtimes" — the production basis for the request-driven autoscaling + runtime-as-component patterns KServe formalises.
- Ibryam & Huß, Kubernetes Patterns 2e — Elastic Scale (ch.29) and Service Mesh (ch.17) — elastic, request-aware scaling and the mesh layer that propagates traffic splits.
- Official: KServe docs https://kserve.github.io/website/ (
InferenceService,ServingRuntime, serverless vs RawDeployment, canary, the v1/v2 inference protocols); Knative Serving docs https://knative.dev/docs/serving/ (KPA, concurrency targets, scale-to-zero); Seldon Core docs https://docs.seldon.io/ (Models / Pipelines / Experiments); NVIDIA Triton docs https://docs.nvidia.com/deeplearning/triton-inference-server/ (multi-model, dynamic batching, MIG).