07 — ML pipelines and workflows¶
The orchestration layer that ties Part 12 together: Argo Workflows (
Workflow/WorkflowTemplate/ClusterWorkflowTemplate/CronWorkflow; DAG vs Steps; parameter + artifact passing; PVC- or object-storage-backed artifacts;withItemsparameter sweeps) and why it differs from Argo CD (Part 07 ch.04) — Workflows is the pipeline run, CD is the desired state for the pipeline manifests themselves; Kubeflow Pipelines (KFP —Pipeline/Run, the Argo-Workflows-backend of v1, KFP v2 + metadata/lineage, the KFP SDK to compile Python -> the IR); Tekton Pipelines as the CI-shaped alternative (one paragraph: when which); event-driven retraining with Argo Events (EventSource->Sensor->Workflow); the ML pipeline shape (ingest -> preprocess -> train -> eval -> register -> promote -> monitor) and artifact + metadata lineage (which dataset + code SHA + hyperparameters made this model?); parameter sweeps + HPO via Argo WorkflowswithItemsand Katib (the Kubeflow HPO operator —Experiment/Trial/Suggestion); scheduled retraining withCronWorkflow; a fair, honest Argo Workflows vs Kubeflow Pipelines vs Airflow/Prefect/Dagster comparison — applied by installing Argo Workflows (and Argo Events) via pinned Helm in their own namespaces, and authoring the recommendations train -> eval -> register -> promote pipeline as a realWorkflowTemplate+CronWorkflow+ Argo Events webhook trigger, inexamples/bookstore/ml/pipeline/, reusing the same train + serve images from ch.04 and ch.06 so the loop is end-to-end real.
Estimated time: ~60 min read · half-day hands-on
Prerequisites: Part 12 ch.04 — training step the pipeline orchestrates · Part 12 ch.06 — serving step the pipeline promotes to · Part 07 ch.04 — Argo CD contrasted with Argo Workflows
You'll know after this: • author an Argo WorkflowTemplate with DAG, parameters, and artifact passing · • compare Argo Workflows / Kubeflow Pipelines / Tekton / Airflow / Prefect for ML · • wire Argo Events EventSource -> Sensor -> Workflow for drift-triggered retraining · • capture artifact + metadata lineage (dataset + code SHA + hyperparameters → model) · • schedule retraining via CronWorkflow and parameter sweeps via withItems / Katib
Why this exists¶
So far Part 12 has produced the pieces: a GPU/accelerator story
(ch.02), a batch/queue/gang shape
(ch.03), real training jobs
(ch.04), notebooks
(ch.05), and a serving layer
(ch.06). Each produces or consumes the
model.joblib artifact at the centre of the recommendations thread.
What is missing — the "who triggers what?" layer — is the pipeline:
the single object that knows training comes before evaluation, evaluation
before registration, registration before promotion, and that the same
sequence should run again tomorrow on tomorrow's data without anyone
typing kubectl apply four times.
Three concrete failures motivate a pipeline runtime rather than ad-hoc chains of Jobs:
- Order is implicit. "Run training, then evaluation, but only if
training succeeded; then registration; then promotion." Encoded as four
shell commands in a runbook, that ordering rots: a new step is added in
the middle, a side-effect of the third step is missed, the on-call
person on a Saturday skips evaluation "just this once". A pipeline runtime
makes ordering and conditional execution a declarative DAG —
kubectl applya Workflow object, the controller does the rest. - Re-runs are first-class. "Train this every night against the new
day's data" is not a thing
kubectldoes. Kubernetes hasCronJobfor a single Job, but ML retraining is the whole DAG — aCronWorkflowschedules the whole pipeline against parameters that change per-run. - Lineage is the unit of trust for a model. "Which dataset, which code SHA, which hyperparameters produced the model currently serving traffic?" — when the answer is "we'd have to grep CI logs", you cannot roll back to a known-good model, debug a regression, or pass an audit. A pipeline runtime captures the inputs and outputs of every step as metadata and artifacts, so the answer to "where did this model come from?" is a query, not an investigation.
This chapter is the orchestration spine of Part 12. It picks Argo
Workflows as the Kubernetes-native pipeline runtime (a single CRD, no
second control plane, identical posture to the rest of the guide), names
Kubeflow Pipelines as the ML-opinionated layer many teams pick on top,
contrasts Tekton (the CI-shaped alternative — Argo Workflows for ML
pipelines, Tekton for CI/CD, when which), shows event-driven retraining
with Argo Events, and walks HPO via Argo withItems and Katib.
Then it walks the Bookstore's recommendations pipeline as one real
WorkflowTemplate running on kind. The references are Ibryam & Huß,
Kubernetes Patterns, Periodic Job (ch.7) and Daemon Service + the
Argo Workflows and Kubeflow Pipelines docs.
Mental model¶
A pipeline is a DAG of Pods (the steps) whose ordering, parameters, artifacts, and lifecycle live in a CRD. A pipeline runtime is the controller that reconciles that DAG.
- The unit of work is the step, the unit of orchestration is the
DAG. A step is a container that does one thing (train, eval, register,
promote) — a Pod, exactly as it would be on its own. A DAG wires steps
with
depends:edges and parameter / artifact passing. The pipeline runtime (Argo Workflows, KFP, Tekton) is the controller that, given the CRD, creates Pods for ready steps, waits for them, runs the next batch, cleans up. Same shape as every other Kubernetes controller in the guide; what is new is the DAG primitive and the parameter/artifact bus between steps. - Parameters are inline values; artifacts are files. Parameters
(strings, numbers) are passed by reference — Argo's
{{tasks.eval.outputs.parameters.score}}syntax. Artifacts (amodel.joblib, ametrics.json, adataset.parquet) are passed by file — written to a shared PVC, an S3 bucket, or Argo's artifact store, and surfaced to the next step at a mount path. Use parameters for scalars (a score, a URI, a hyperparameter); artifacts for files (models, datasets, plots). The Bookstore pipeline uses PVC-backed artifacts (therecommender-modelPVC from ch.04) — the simplest path for kind. In prod the artifact store is object storage. - Argo Workflows is the Kubernetes-native default. A single CRD set
(
Workflow,WorkflowTemplate,ClusterWorkflowTemplate,CronWorkflow), one controller, container-native steps, DAG + Steps semantics, parameter + artifact passing,withItemssweeps, retries, exit handlers. It is the engine Kubeflow Pipelines v1 used underneath; v2 introduced its own middle layer (Kubernetes-native KFP v2 backends still use Argo Workflows). Pick Argo Workflows when you want the pipeline primitive without buying the ML platform around it. - Kubeflow Pipelines is the ML opinion on top. KFP adds an SDK (
kfpin Python: define a pipeline as decorated Python functions; the SDK compiles to the IR / Argo Workflow YAML), a UI for browsing runs and experiments, a metadata store for tracking inputs/outputs across runs (lineage!), and a model registry. Pick KFP when (a) the data scientists prefer to write pipelines as Python rather than YAML, (b) you want cross-run lineage out-of-the-box, or (c) you are already adopting Kubeflow as a distribution. The cost: another layer to install/operate, and decisions KFP made for you you may not want. - Tekton is the CI shape, not the ML shape. Tekton Pipelines is also
a Kubernetes-native pipeline runtime, but its model is CI-flavoured
(
Task/Pipeline/PipelineRun, withWorkspaces for build inputs andResults for outputs). It is excellent for CI/CD pipelines (you see it under Konflux, OpenShift Pipelines), and runnable for ML — but Argo Workflows is the community-default for ML on Kubernetes and has the artifact + lineage + sweep ergonomics ML needs. Rule of thumb: Argo Workflows for ML pipelines, Tekton for CI/CD. Both can coexist. - Event-driven pipelines = Argo Events on top. "Run this pipeline
when a new dataset arrives" is not a schedule; it is an event. Argo
Events models it as an
EventSource(a webhook, a Git provider, S3, PubSub, SQS, Kafka, …) and aSensorthat watches the EventSource'sEventBusfor matching events and translates them into aWorkflowtrigger. The EventSource + Sensor + EventBus install as one Helm chart; the pattern is the Ibryam Daemon Service (a long-running listener) driving the Periodic Job (the pipeline). Scheduled (CronWorkflow) and event-driven (Argo Events) coexist — choose by trigger. - Lineage = "which dataset + code SHA + hyperparameters made this
model?". A pipeline runtime that does not record this is a worse
position than no pipeline at all (you now have automated lineage-
destroying runs). The simplest discipline: every pipeline run records,
as parameters or artifacts, the dataset URI it consumed, the
image SHA of the train step (which pins the code), and the hyper-
parameters in the
Workflow.spec.arguments.parameters. A model registry (MLflow, KFP Model Registry) is the canonical place; a ConfigMap stamp (pipeline/register-cm-template.yaml) is the kind-runnable proxy this chapter ships; both are honestly marked.
This builds on Part 07 ch.04 (Argo CD = desired state in Git; we now have a second Argo — Workflows — that runs pipelines), Part 01 ch.07 (Job/CronJob — pipelines are DAGs of those), and ch.04 (the train step we orchestrate). It does not re-teach any of them.
Diagrams¶
The recommendations DAG — train -> eval -> register -> promote, with artifacts + parameters flowing (Mermaid)¶
flowchart TB
submit["argo submit --from workflowtemplate/
recommender-pipeline"]
ctl["Argo Workflows controller
(ns argo)"]
ws["bookstore-ml namespace
(PSA enforce: restricted)"]
train["train
image: bookstore/recommender-train:dev
writes model.joblib to PVC"]
pvc[("PVC recommender-model
model.joblib
metrics.json")]
eval["eval
same image; loads model.joblib
writes metrics.json
emits 'score' parameter"]
reg["register
kubectl create configmap
(model_uri + score + workflow + ts)"]
cm[("ConfigMap recommender-
model-registry-<workflow>
(registry stamp proxy)")]
promote["promote
kubectl annotate + rollout restart
deploy/recommender"]
serve[("Deployment recommender
(reads PVC on start)")]
submit --> ctl --> ws
ws --> train --> pvc
pvc --> eval --> reg
eval -- score --> reg
reg --> cm
reg --> promote --> serve
Argo Workflows vs Kubeflow Pipelines vs Tekton vs Airflow (ASCII)¶
WHO DOES WHAT (pipeline runtimes for ML on Kubernetes)
────────────────────────────────────────────────────────────────────────────
Argo Workflows KUBERNETES-NATIVE, CONTAINER-NATIVE DAG runner. CRDs:
(argoproj.io) Workflow, WorkflowTemplate, ClusterWorkflowTemplate,
CronWorkflow. DAG + Steps + Suspend. Parameter and
artifact passing (PVC, S3, OSS, GCS, …). withItems
for parameter sweeps. Exit handlers. = the community
default for ML pipelines on Kubernetes; what KFP
v1 ran on; the lead choice in this chapter.
Kubeflow Pipelines ML-OPINIONATED layer on top. SDK (`kfp` Python ->
(kubeflow.org) compile to IR). UI (browse runs, experiments,
compare). Metadata store (cross-run lineage). Model
registry. v1 used Argo Workflows as the backend; v2
keeps the Kubernetes-native backend story. = adopt
when you want the ML *platform* (Python pipelines +
UI + lineage + registry) and accept the operate
cost.
Tekton Pipelines CI/CD-SHAPED Kubernetes-native pipeline runtime.
(tekton.dev) Task / Pipeline / PipelineRun / Workspace. The right
runtime for *CI* on Kubernetes (build, test, sign,
push). Runnable for ML, but lacks Argo Workflows'
ML-tuned artifact + sweep ergonomics. = use for
CI/CD; pair with Argo Workflows for ML.
Airflow / Prefect / HISTORICALLY non-Kubernetes-native — worker-based
Dagster schedulers (Celery / Dask / their own workers) that
*can* run on Kubernetes (KubernetesExecutor /
KubernetesPodOperator), but Kubernetes is one
execution mode, not the model. Strong scheduling +
backfill + Python ergonomics + a vast operator
catalog (databases, queues, vendor APIs). = pick
when you have non-Kubernetes work and want one
scheduler over your *data platform*; pair with
Argo Workflows for the Kubernetes-native ML bits
if you have both.
WHICH IS THE SAME OBJECT? — None: they are alternatives. They DO coexist
─────────────────── in real fleets (Argo CD for GitOps, Argo Workflows
for ML pipelines, Tekton for CI/CD; Airflow for
upstream data orchestration). The skill is choosing
the right one per job.
OUR BOOKSTORE TREE
examples/bookstore/ml/pipeline/recommender-workflow.yaml (Argo
Workflow)
examples/bookstore/ml/pipeline/recommender-cronworkflow.yaml (Argo
schedule)
examples/bookstore/ml/pipeline/recommender-eventsource.yaml
examples/bookstore/ml/pipeline/recommender-sensor.yaml (Argo
Events)
Hands-on with the Bookstore¶
Assumed working directory: the guide repo root (full-guide/). Requires
the PSA-restricted bookstore-ml namespace
(ch.01), the train image
(bookstore/recommender-train:dev from
../examples/bookstore/ml/train/Dockerfile),
and the recommender-model PVC (ch.04).
The pipeline reuses the same images as ch.04 and ch.06. This is the through-line of Part 12: every chapter has produced a real ingredient; this chapter is the recipe. The Workflow's
trainstep usesbookstore/recommender-train:dev; theevalstep reuses the same image as a tiny Pythonscript(sklearn + joblib are already baked); theregisterstep usesbitnami/kubectl; thepromotestep uses the same. No new images to build for this chapter.
1. Install Argo Workflows (pinned; own namespace)¶
helm repo add argo https://argoproj.github.io/argo-helm
ARGO_WORKFLOWS_VERSION="0.42.0" # bump deliberately; chart version != app version
helm install argo-workflows argo/argo-workflows \
--version "$ARGO_WORKFLOWS_VERSION" \
-n argo --create-namespace --wait
# CRDs come with the chart. Confirm:
kubectl api-resources | grep argoproj
# workflows wf argoproj.io/v1alpha1 true Workflow
# workflowtemplates wftmpl argoproj.io/v1alpha1 true WorkflowTemplate
# clusterworkflowtemplates cwftmpl argoproj.io/v1alpha1 false ClusterWorkflowTemplate
# cronworkflows cwf argoproj.io/v1alpha1 true CronWorkflow
# The Argo CLI (optional but recommended; the chapter uses it).
# macOS: brew install argo
# Linux: curl -sSL -o argo
# https://github.com/argoproj/argo-workflows/releases/download/v3.5.13/argo-linux-amd64.gz
# (pin a tag; never `latest`)
Pinned, never
latest. Bump these deliberately when a new release ships. The Argo Helm chart's--versionis the CHART version (0.42.0here), which packages a specific WORKFLOW CONTROLLER version (the chart README documents the mapping). Both are pinned.
2. The pipeline: train -> eval -> register -> promote¶
pipeline/recommender-workflow.yaml
is a WorkflowTemplate (re-usable; runnable many times via argo submit
--from workflowtemplate/recommender-pipeline). The DAG:
train ──► eval ──► register ──► promote
(image bookstore/recommender-train:dev)
(eval reuses the SAME image as a Python script)
(register: kubectl create configmap = registry stamp proxy)
(promote: kubectl rollout restart deploy/recommender)
CRD-intrinsic dry-run. Without Argo Workflows installed:
Same precedent askubectl apply --dry-run=client \ -f examples/bookstore/ml/pipeline/recommender-workflow.yaml # error: ... no matches for kind "WorkflowTemplate" in version # "argoproj.io/v1alpha1" — expected, schema-correct.raw-manifests/51-/70-/83-,argocd/,operators/,chaos/,ml/batch/,ml/serve/recommender-inferenceservice.yaml. The built-in SA/Role/RoleBinding triple at the top of the file dry-runs cleanly anywhere. After installing Argo Workflows, a--dry-run=servervalidates against the live API.
Apply the template (and its RBAC scaffolding):
# Prereqs from earlier chapters:
kubectl apply -f examples/bookstore/ml/train/recommender-train-job.yaml
kubectl wait --for=condition=complete job/recommender-train \
-n bookstore-ml --timeout=300s
kubectl apply -f examples/bookstore/ml/serve/recommender-deployment.yaml
kubectl apply -f examples/bookstore/ml/serve/recommender-service.yaml
# The pipeline:
kubectl apply -f examples/bookstore/ml/pipeline/recommender-workflow.yaml
kubectl get workflowtemplate -n bookstore-ml
# NAME AGE
# recommender-pipeline 7s
# Run it once:
argo submit --from workflowtemplate/recommender-pipeline -n bookstore-ml --watch
# (or, if you don't have the argo CLI, generate an ad-hoc Workflow from the
# template by name reference — the documented kubectl-only path:)
cat <<'EOF' | kubectl create -f -
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: recommender-pipeline-
namespace: bookstore-ml
spec:
serviceAccountName: argo-workflow
workflowTemplateRef:
name: recommender-pipeline
EOF
# Watch progress:
argo list -n bookstore-ml
argo get -n bookstore-ml @latest
argo logs -n bookstore-ml @latest --follow
Successful run prints, abbreviated:
[train] config=Config(model_dir='/workspace/model', seed=42, ...)
[train] wrote /workspace/model/model.joblib (size=21678 bytes)
[eval] {"kind":"item-knn-cooccurrence","n_books":200, ...,
"avg_top1_cosine":0.412..., "min_score_gate":0.05,"passed":true}
[register] stamped recommender-model-registry-recommender-pipeline-abc12
uri=pvc://recommender-model/model.joblib score=0.412... at=...
[promote] recommender restarted with new model
(workflow recommender-pipeline-abc12)
Inspect the registry stamp:
kubectl get configmap -n bookstore-ml \
-l app.kubernetes.io/component=recommender-model-registry
kubectl describe configmap -n bookstore-ml \
$(kubectl get cm -n bookstore-ml \
-l app.kubernetes.io/component=recommender-model-registry \
-o jsonpath='{.items[-1:].metadata.name}')
# Data:
# model_uri: pvc://recommender-model/model.joblib
# score: 0.412345
# registered_at: 2026-05-19T02:00:00Z
# workflow: recommender-pipeline-abc12
The illustrative-template-shape of that ConfigMap (so you can kubectl
apply -f it standalone and see one) is
pipeline/register-cm-template.yaml.
The
registerstep is a kind-runnable proxy. In real life the registration target is MLflow Model Registry, KFP Model Registry, or an OCI artifact in your registry — see ch.08 for the stack story. The ConfigMap shape carries the same information (URI + score + timestamp + workflow). Honesty over fidelity: the loop runs end-to-end on kind, and the swap-to-real-registry is one step's image away.
3. Schedule nightly retraining (CronWorkflow)¶
pipeline/recommender-cronworkflow.yaml
schedules the WorkflowTemplate every night at 02:00 UTC. Same CRD model as
CronJob for a single Job (the
Ibryam Periodic Job pattern at the pipeline level):
kubectl apply -f examples/bookstore/ml/pipeline/recommender-cronworkflow.yaml
kubectl get cronworkflow -n bookstore-ml
# NAME AGE SCHEDULE TIMEZONE ...
# recommender-nightly 3s 0 2 * * * UTC ...
# Trigger a run NOW (off-schedule, for testing):
argo cron list -n bookstore-ml
argo submit --from cronworkflow/recommender-nightly -n bookstore-ml --watch
concurrencyPolicy: Replace ensures a slow night's run never collides
with the next night's; startingDeadlineSeconds: 600 bounds the missed-
schedule recovery window if the controller is restarted.
4. Event-driven retraining (Argo Events)¶
pipeline/recommender-eventsource.yaml
listens for a webhook (POST /recommender-dataset-ready on :12000).
pipeline/recommender-sensor.yaml
turns a matching event into a Workflow from the template:
# Argo Events controller + EventBus.
ARGO_EVENTS_VERSION="2.4.7"
helm install argo-events argo/argo-events \
--version "$ARGO_EVENTS_VERSION" \
-n argo-events --create-namespace --wait
# Default EventBus for the namespace.
kubectl apply -n argo-events -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata: { name: default }
spec:
# pinned NATS JetStream version (controller-managed image, not a Helm chart)
# — bump together with the Argo Events chart version
jetstream: { version: "2.10.11" }
EOF
kubectl apply -f examples/bookstore/ml/pipeline/recommender-eventsource.yaml
kubectl apply -f examples/bookstore/ml/pipeline/recommender-sensor.yaml
# Fire an event from inside the cluster:
kubectl run -n argo-events curl-once --rm -it --restart=Never \
--image=curlimages/curl:8.10.1 --command -- \
curl -X POST -H 'content-type: application/json' \
-d '{"dataset_uri":"pvc://recommender-model/"}' \
http://recommender-dataset-eventsource-svc:12000/recommender-dataset-ready
# The Sensor creates a new Workflow in bookstore-ml.
argo list -n bookstore-ml
In prod the webhook is a
git/gitlab/github/s3/pubsub/sqs/kafkaEventSource — pick by your data publication path. The shape is the same: an event inargo-events, a Sensor that creates a Workflow inbookstore-ml.
5. Parameter sweeps with withItems + a note on Katib¶
The simplest HPO pattern Argo Workflows supports natively is withItems:
fan out a single step over a list of parameter values. The next file you'd
add to this tree (kept out so the chapter directory stays small) — sketch:
templates:
- name: sweep
steps:
- - name: train-grid
template: train-step
arguments:
parameters:
- { name: seed, value: "{{item.seed}}" }
- { name: top-k, value: "{{item.k}}" }
withItems:
- { seed: "41", k: "5" }
- { seed: "42", k: "5" }
- { seed: "42", k: "10" }
- { seed: "43", k: "10" }
withItems runs four parallel train-step Pods (one per item), each with
its own seed and k parameter. The Workflow's outputs capture each
run's score; an exit handler picks the best. This is the Argo Workflows
sweep. Katib is the Kubeflow HPO operator: the Experiment CRD
declares a search space (categorical / integer / double / discrete /
log-uniform), a Suggestion (algorithm: random, grid, Bayesian, TPE,
…), a Trial template (the training job), and Katib reconciles the
loop — create Trials, observe metrics, suggest next set of
hyperparameters. Pick Katib when the search is adaptive (Bayesian /
TPE) and the cost of a Trial justifies a smart sampler; withItems for
small grids where you just want N parallel runs.
Out of scope for the Bookstore tree. The recommender's hyperparameter surface is trivially small (seed + top_k); the sweep is illustrative. The chapter mentions Katib because it is the right answer once HPO matters.
6. Argo CD vs Argo Workflows — the two Argos¶
This is the question every team asks once they have both installed.
Part 07 ch.04 named Argo CD as
the GitOps reconciler: a controller that reads argocd/ manifests from
Git and reconciles them into the cluster. This chapter adds Argo
Workflows: a controller that runs pipeline runs.
They are the same shape (a controller + a CRD), at different layers:
| Argo CD | Argo Workflows | |
|---|---|---|
| What it reconciles | the cluster's desired state (Deployments, Services, ConfigMaps, Helm releases, Kustomize, …) | a pipeline run (a DAG of Pods) |
| Trigger | a Git commit (or webhook) | argo submit / CronWorkflow / Argo Events trigger |
| Object the controller owns | Application, AppProject, ApplicationSet |
Workflow, WorkflowTemplate, CronWorkflow |
| Output | the cluster matches Git | a Workflow's status, with per-step Pods and artifacts |
| Lives at | argocd/ (delivery) |
examples/bookstore/ml/pipeline/ (orchestration) |
Both at once is the right setup for ML: Argo CD reconciles pipeline/
(the WorkflowTemplate + CronWorkflow manifests themselves come from Git),
Argo Workflows runs the pipelines. Roll back a pipeline change = revert
the Git commit; CD reapplies; Workflows runs the previous version
tomorrow.
How it works under the hood¶
- The
Workflowcontroller reconcile. When youargo submitaWorkflow(or the Sensor / CronWorkflow creates one), the argo-workflows controller inargowatches newWorkflowobjects, walks the DAG, and for each ready step creates a Pod in the target namespace (herebookstore-ml) with two containers: the step'smaincontainer (your code) and awaitsidecar (Argo's executor that streams logs, captures outputs/artifacts, and reports back). The controller records each step's state as a sub-resource of the Workflow (status.nodes[<STEP>].phase);argo logs @latestis reading those. - Parameter + artifact passing. Parameters are strings inside the
Workflow object — Argo substitutes
{{tasks.eval.outputs.parameters.score}}at template expansion. Artifacts (files) move via the artifact bus: S3/GCS/OSS/Azure/PVC. The Bookstore uses PVC artifacts (therecommender-modelPVC mounted intotrain+eval+register), the simplest path for kind. In prod, configure theartifactRepositoryConfigMap in theargonamespace to point at object storage; switchvolumes/volumeMountsto Argo'sartifacts:pattern; same DAG. WorkflowTemplate+ClusterWorkflowTemplate+workflowTemplateRef. AWorkflowTemplateis a namespaced library of templates and a defaultentrypoint— the Bookstore'srecommender-pipelineis one. AClusterWorkflowTemplateis the cluster-scoped equivalent (one library shared across teams). AWorkflowcanworkflowTemplateRefeither — cheap, parameter-overridable. The Sensor and CronWorkflow in this tree both useworkflowTemplateRef: no duplication, one source of truth for the pipeline shape.CronWorkflowsemantics. Field-for-field analogous to Kubernetes'CronJob(schedule,timezone,concurrencyPolicy: Allow|Forbid|Replace,startingDeadlineSeconds,successfulJobsHistoryLimit/failedJobsHistoryLimit). At each cron tick the controller creates aWorkflowand lets the Workflow controller take over. Replace is the right policy for retraining where two concurrent runs would race on the model PVC; Forbid for runs where overlap is acceptable but you never want them to multiply.- Argo Events: EventSource -> EventBus -> Sensor -> trigger. The
EventSource is a long-running Pod that produces events from its
configured upstream (webhook, Git provider, S3, …). It writes events to
the EventBus (a NATS JetStream cluster the chart installs in
argo-events). A Sensor subscribes to the EventBus, applies adependencies:filter, and on match fires itstriggers[]— for ML that is thek8strigger creating aWorkflow. The Sensor's own SC is PSA-restricted(the chart sets it; the manifest in this dir overrides it explicitly for the trigger workflow'sserviceAccountName). - Kubeflow Pipelines underneath. KFP v1 compiled Python pipelines to Argo Workflow YAML and submitted them — so most of what you read here applies. KFP v2 introduced an IR (a domain-specific format describing the pipeline) and re-implemented the Kubernetes-native backend; the user-facing semantics (DAG, parameters, artifacts, caching) are largely the same. The trade-off is flexibility now vs ML-platform now: Argo Workflows is more flexible and lower-overhead; KFP is more opinionated and ships with the rest of Kubeflow.
- Tekton's split-personality. Tekton models the pipeline as a graph
of
Tasks withWorkspaces (PVCs for inputs/outputs) andResults (string outputs). For CI/CD it is the right shape (build, test, sign, push, deploy). For ML, Argo Workflows' artifact model (object-storage first-class, sweep ergonomics, exit handlers) is the better fit. The rule of thumb stands: Tekton for CI/CD, Argo Workflows for ML pipelines, both can coexist. - Lineage = artifacts + metadata. "Which dataset + code SHA + hyperparameters produced this model?" answers itself when:
- the dataset URI is a Workflow parameter (
{{workflow.parameters.dataset-uri}}), - the code SHA is the image SHA — pin
bookstore/recommender-train:devtobookstore/recommender-train@sha256:...(Helm / Kustomize image digest pinning from Part 07 ch.02), - the hyperparameters are
workflow.spec.arguments.parameters, - the outputs are an artifact (
model.joblib) + a parameter (score) + a registry stamp (the ConfigMap, or an MLflow record). Argo Workflows itself does not own the metadata DB; KFP v2 / MLflow /Weights & Biases do. The pipeline is the producer of lineage data; the registry/metadata store is its consumer. - PSA on every workflow Pod. The
WorkflowTemplate'spodSpecPatchappliesrunAsNonRoot: true, non-root UID,seccompProfile: RuntimeDefaultto every step Pod's PodSpec; each step's containersecurityContextaddsallowPrivilegeEscalation: false,readOnlyRootFilesystem: true,capabilities.drop: ["ALL"]. The same shape every Bookstore ML pod uses — pipeline Pods are not exempt from PSA.
Production notes¶
In production: install Argo Workflows (and Argo Events, if you need event-driven retraining) via pinned Helm charts in their own namespaces (
argo,argo-events). Treat upgrades like any control- plane component — read the chart changelog, watch for CRD-version bumps, run a soak test. Never install from areleases/latestURL.In production: artifacts live in object storage, not on per- cluster PVCs. Configure the
artifactRepositoryConfigMap in theargonamespace to point at S3/GCS/Azure (use the Pod's cloud identity — Part 10 ch.03 — not static keys in a Secret). The Bookstore's PVC artifact path is the kind-runnable fallback.In production: pipeline manifests live in Git, reconciled by Argo CD (Part 07 ch.04). The
WorkflowTemplate/CronWorkflow/EventSource/Sensorare reviewed PRs like every other manifest; Argo CD applies; Argo Workflows runs the pipeline. Two Argos, two layers, one Git.In production: lineage is not optional. Pin the train step's image to a digest (
@sha256:...); record the dataset URI and hyperparameters as Workflow parameters; write the result to a model registry (MLflow, KFP Model Registry, OCI artifact) and a metadata store (MLflow tracking, KFP metadata, W&B). Where you stop, the audit and the rollback story stops.In production: the
promotestep is a GitOps commit, not akubectl rollout restart. The pipeline step writes a PR that bumps theInferenceService.spec.predictor.storageUrito the new versioned URI and setscanaryTrafficPercent: 20; review/merge; Argo CD applies; KServe shifts traffic (Part 12 ch.06). The kind-runnable proxy in this tree restarts the Deployment for illustration; it is a step's image change to swap.In production: scheduled retraining belongs to the pipeline, not a CronJob. A
CronWorkflowruns the whole DAG (train + eval + register + promote) under one identity, with one history, with one place to read logs and rollback. ACronJobper step reintroduces the runbook problem this chapter exists to retire.In production: event-driven retraining is the right shape when data arrival is the trigger. A
git pushto the dataset repo, an S3Object Createdevent, a Kafka topic message — all become an Argo EventsEventSource. The Sensor maps them to a Workflow trigger; no external scheduler. The pattern is Daemon Service + Periodic Job (Ibryam): one listener, one pipeline. The webhook EventSource in this tree has no authentication. In production, put it behind an Ingress with auth (OAuth2 Proxy, mTLS, or a token header check); or use a git/s3/pubsub EventSource where authentication is part of the event source itself.In production: retry-safety is per-step, not pipeline-wide.
trainis retry-safe: PyTorch/numpy/sklearn workloads idempotently reproducemodel.joblibfrom the same dataset.evalis retry-safe: it's a pure read ofmodel.joblibplus a fresh metric write.registeris retry-safe: the workflow-id-suffixed ConfigMap is unique per attempt.promoteis NOT retry-safe:kubectl annotatemay commit beforerollout restartfails (e.g., Pod quota exceeded), leaving the registry stamp marked promoted but the serving Deployment still on the old image. Ifkubectl annotatesucceeds but the subsequentrollout restartfails (a Pod quota was exceeded, a transient API server hiccup), the registry stamp says "promoted" while the serving Deployment still serves the old image — a half-promoted state. In real GitOps promotion (PR + Argo CD), the equivalent failure is the PR merged but the sync stuck on a CRD validation error; in both cases, the next reconcile loop resolves it, but a registry stamp is not a deploy. If thetrainstep's Pod is OOM-killed mid-training, Argo's retryStrategy retries the step from the beginning of its container (Argo doesn't restart the failed Pod; it creates a new Pod for the new attempt). The PVC artifact from the failed attempt is whatever bytes the trainer wrote before OOM; the new attempt overwrites them. This is whytrainshould checkpoint to a subdirectory and pick up the latest valid checkpoint on start — the chapter on distributed training covers this pattern (see ./04-distributed-training.md).In production: observability — every workflow Pod's logs and events go to the cluster's logging/metrics stack (Part 06 ch.01). Argo Workflows exposes Prometheus metrics: workflow duration, step failure rate, queue depth. Alert on success rate dropping and p95 duration drifting.
In production: PSA on every workflow Pod. The
WorkflowTemplate.spec.podSpecPatch+ each step'scontainer.securityContextkeep the workflow Pods restricted-admissible. The Argo Workflows controller lives inargo(operator's own namespace); the workflow Pods land inbookstore-ml. No exceptions.In production: artifacts in object storage; parameters for scalars. The PVC artifact path is the kind-runnable fallback; the prod shape is
artifactRepositorypointing at S3/GCS/Azure with Pod cloud identity. Parameters (scores, URIs, hyperparameters) stay inline. The Bookstore's PVC story is the simplest illustrative shape; swap for object storage when artifacts cross PVC zones or grow beyond a few GiB.In production: CRD-backed manifests carry the CRD-intrinsic header note. A reader who client-dry-runs a
WorkflowTemplate/CronWorkflow/EventSource/Sensorwithout the operator installed seesno matches for kind …— expected, schema-correct. Every Argo Workflows / Argo Events file in this tree carries that header note, identical precedent to the guide's other CRD-backed manifests.In production: HPO with cost in mind. Bayesian / TPE samplers in Katib find a good hyperparameter set in fewer trials than a grid — every trial is a training Pod, and every training Pod is GPU- hours. Cost-aware HPO is the cheapest ML optimisation most teams ignore (Part 12 ch.08).
Quick Reference¶
# Install Argo Workflows + Argo Events (pinned; own namespaces)
ARGO_WORKFLOWS_VERSION="0.42.0"
ARGO_EVENTS_VERSION="2.4.7"
helm repo add argo https://argoproj.github.io/argo-helm
helm install argo-workflows argo/argo-workflows \
--version "$ARGO_WORKFLOWS_VERSION" -n argo --create-namespace --wait
helm install argo-events argo/argo-events \
--version "$ARGO_EVENTS_VERSION" -n argo-events --create-namespace --wait
# Default EventBus (Argo Events needs one):
kubectl apply -n argo-events -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata: { name: default }
# pinned NATS JetStream version (controller-managed image, not a Helm chart)
# — bump together with the Argo Events chart version
spec: { jetstream: { version: "2.10.11" } }
EOF
# Apply the recommender pipeline + cron + event trigger
kubectl apply -f examples/bookstore/ml/pipeline/recommender-workflow.yaml
kubectl apply -f examples/bookstore/ml/pipeline/recommender-cronworkflow.yaml
kubectl apply -f examples/bookstore/ml/pipeline/recommender-eventsource.yaml
kubectl apply -f examples/bookstore/ml/pipeline/recommender-sensor.yaml
# Run the pipeline once
argo submit --from workflowtemplate/recommender-pipeline -n bookstore-ml --watch
argo list -n bookstore-ml
argo get -n bookstore-ml @latest
argo logs -n bookstore-ml @latest
# Schedule + event-trigger
kubectl get cronworkflow -n bookstore-ml
kubectl get eventsource,sensor -n argo-events
Minimal skeleton (Argo Workflows WorkflowTemplate, DAG, PSA-restricted
step):
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: my-pipeline
namespace: my-namespace
spec:
serviceAccountName: my-workflow-sa # not the default SA!
entrypoint: main
podSpecPatch: |
securityContext:
runAsNonRoot: true
runAsUser: 65532
seccompProfile: { type: RuntimeDefault }
templates:
- name: main
dag:
tasks:
- { name: step-a, template: a }
- { name: step-b, template: b, depends: "step-a.Succeeded" }
- name: a
container:
image: <YOUR-IMAGE>
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
- name: b
container:
image: <YOUR-IMAGE>
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
Checklist:
- Argo Workflows installed via pinned Helm in own namespace (
argo) - Pipeline encoded as a
WorkflowTemplate(re-usable) — not a one-shotWorkflow -
CronWorkflowfor scheduled retraining; Argo Events for data-driven retraining - Per-pipeline
ServiceAccount+ namespace-scoped RBAC (not the default SA, no cluster-wide rights) - PSA-
restrictedSC on every workflow Pod (podSpecPatch+ container SC) - Lineage step records dataset + code + model SHA (dataset URI + image SHA + hyperparameters as Workflow parameters; outputs in a registry + metadata store)
- Promote is a GitOps commit in prod (PR -> Argo CD -> KServe), not an in-pipeline
kubectl rollout restart
Test your understanding¶
Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.
-
Why is Argo Workflows for pipeline runs and Argo CD for desired state — what would go wrong if you used Argo CD for the training pipeline?
Show answer
Argo CD reconciles to a steady desired state — "this set of manifests should exist in the cluster." A training pipeline is a run-to-completion DAG: ingest → preprocess → train → eval → register. There's no "steady state" — each run produces a new model and exits. Forcing Argo CD to manage it would mean either (a) recreating the Workflow object on every commit (Argo CD reapplies, but the previous run's state is unclear) or (b) treating runs as state (incoherent with GitOps). The split is: Workflows define the run, Argo CD defines what Workflows + WorkflowTemplates + permissions exist. Two tools, two concerns. -
A training run took 4 hours and produced a model that scores 87% on eval. Three months later, you want to know "what dataset + code + hyperparameters made this model?" The Workflow logs are gone (retention). What's the design fix?
Show answer
Lineage as data, not logs. Every training step must write a structured record (dataset URI + git SHA + container digest + hyperparameters + eval score) to a durable metadata store: MLflow tracking, KFP metadata, or a Postgres-backed registry. The model artifact in the registry references that metadata record by ID. Logs are ephemeral; lineage records are durable. Without this discipline, you have a model and no way to reproduce it — a "model lottery." The chapter's lineage-step pattern is non-negotiable for any model that reaches production. -
You set up Argo Events with an
EventSourcewatching an S3 bucket for new training data, and aSensorthat triggers aWorkflow. New data arrives but no Workflow starts. What do you check?
Show answer
(1) `kubectl get eventsource -o yaml` — is the EventSource healthy and connected? Check its logs. (2) `kubectl get sensor -o yaml` and look at the sensor's pod logs — is it receiving events but failing to trigger? Common cause: the sensor's trigger template has a parameter referencing a missing event field. (3) Permissions — the sensor's SA needs `create` on `workflows.argoproj.io`. (4) Event filtering — the sensor's `dependencies[].filters` may be rejecting the events (wrong S3 event type, wrong key prefix). (5) Network — if EventSource pulls from S3 via webhook, the public URL must be reachable; if it polls, AWS credentials must work. Walk EventSource → EventBus → Sensor → Workflow as a chain; each link has its own logs. -
Hands-on: write a small
WorkflowTemplatethat runs three steps: download a CSV, train a tiny scikit-learn model, log results. Pass the dataset URI as a parameter. Now submit two runs with different URIs. Where are the artifacts stored?
What you should see
With default Argo Workflows config and a `MinIO` artifact repository, each step's `outputs.artifacts` is uploaded to MinIO with a path like `workflows// / `. Two runs have different paths because the workflow names are unique. The `inputs.parameters.dataset_uri` differ but the `outputs.artifacts.model` differ too. The discipline: artifact paths must include a unique run identifier; parameters must include enough metadata (dataset URI, git SHA) that lineage is traceable from the artifact alone.
Further reading¶
- Ibryam & Huß, Kubernetes Patterns 2e — Periodic Job (ch.7)* and
Daemon Service*** — the patterns underneath
CronWorkflow(periodic retraining) and Argo Events (a long-running event listener producing triggers). - Burns et al., Kubernetes Up & Running — Jobs / CronJobs chapter
for the lower-layer primitives
WorkflowPods reduce to. - Rosso et al., Production Kubernetes, ch.8 "Workload Management" — the production basis for CRD-driven controllers like Argo Workflows.
- Official: Argo Workflows docs
https://argo-workflows.readthedocs.io/ (
Workflow,WorkflowTemplate,CronWorkflow, DAG vs Steps, artifacts,withItems); Argo Events docs https://argoproj.github.io/argo-events/ (EventSource,Sensor, EventBus); Kubeflow Pipelines docs https://www.kubeflow.org/docs/components/pipelines/ (Pipeline,Run, the KFP SDK, v2 IR + metadata); Katib docs https://www.kubeflow.org/docs/components/katib/ (Experiment,Trial,Suggestion, Bayesian/TPE/grid samplers); Tekton Pipelines docs https://tekton.dev/docs/pipelines/ for the CI-shaped alternative.