Appendix B — Glossary¶

Seeded with the original guide; expanded as new Parts shipped. Every significant term and acronym introduced across the guide — currently Parts 00–15 (115 chapters) — appears here, with a precise 1–3 sentence definition and a link to the chapter that defines or most-covers it. Definitions are deliberately brief; the linked chapter has the full treatment and the "why".

Sectioned by domain (and roughly the order the guide introduces them). Use your browser's find for a specific term; bold is the headword.

Cluster architecture & the API model¶

Term	Definition	Covered in
Kubernetes (k8s)	An open-source platform that runs containerized workloads across a cluster of machines, continuously reconciling declared desired state.	00-foundations/01-why-kubernetes.md
Cluster	A set of worker nodes plus a control plane that together run and manage containerized workloads as one logical system.	00-foundations/03-architecture-overview.md
Control plane	The components that make global decisions and detect/respond to events: kube-apiserver, etcd, scheduler, controller-manager (and cloud-controller-manager).	00-foundations/04-control-plane-deep-dive.md
Node	A worker machine (VM or physical) that runs Pods; each runs a kubelet, a container runtime, and kube-proxy, and reports capacity/health to the control plane.	00-foundations/05-node-components.md
kube-apiserver	The front door of the cluster: a REST API server that authenticates, authorizes, admits, and validates objects and is the only component that talks to etcd.	00-foundations/04-control-plane-deep-dive.md
etcd	The consistent, distributed key-value store (Raft consensus) that holds the entire cluster state; the single source of truth, written only by the apiserver.	00-foundations/04-control-plane-deep-dive.md
kube-scheduler	The control-plane component that watches for unscheduled Pods and binds each to a feasible node by filtering then scoring nodes.	04-scheduling/01-scheduler-and-nodes.md
kube-controller-manager	A single binary running the built-in controllers (Deployment, ReplicaSet, Node, Job, EndpointSlice, …), each a reconciliation loop.	00-foundations/04-control-plane-deep-dive.md
cloud-controller-manager	The control-plane component that integrates a cloud provider (LoadBalancer services, node lifecycle, routes); absent on bare local clusters.	00-foundations/04-control-plane-deep-dive.md
kubelet	The node agent that watches the apiserver for Pods assigned to its node, drives the container runtime via CRI to match the PodSpec, and runs probes.	00-foundations/05-node-components.md
kube-proxy	The node component that programs iptables or IPVS rules so Service virtual IPs load-balance to backing Pods. On clusters using a full-eBPF CNI (e.g. Cilium kube-proxy replacement), kube-proxy is replaced by the CNI and does not run.	02-networking/02-services.md
Container runtime	The software that actually runs containers on a node (e.g. containerd), invoked by the kubelet through the CRI.	00-foundations/05-node-components.md
CRI (Container Runtime Interface)	The gRPC API between the kubelet and the container runtime, decoupling Kubernetes from any specific runtime.	00-foundations/05-node-components.md
containerd	A widely-used CRI-compatible container runtime; the runtime inside kind/k3d nodes in this guide.	00-foundations/05-node-components.md
pause container (sandbox)	The tiny per-Pod container that holds the Pod's shared network/IPC namespaces so app containers can join them.	00-foundations/05-node-components.md
Object	A persisted entity in the API with `apiVersion`, `kind`, `metadata`, `spec` (desired), and usually `status` (observed).	00-foundations/06-declarative-api-model.md
GVK (Group/Version/Kind)	`apiVersion` + `kind`; names the type and routes a request to the controller and storage that own it.	00-foundations/06-declarative-api-model.md
GVR (Group/Version/Resource)	The lowercase plural REST form of a GVK (e.g. `apps/v1/deployments`) used in API paths and RBAC rules.	appendix/C-yaml-and-api-conventions.md
`spec` vs `status`	The universal divide: humans/controllers write `spec` (desired); the owning controller/kubelet writes `status` (observed).	00-foundations/06-declarative-api-model.md
Declarative model	You describe desired state in objects and controllers continuously make reality match — you assert the destination, not the steps.	00-foundations/06-declarative-api-model.md
controller	A control loop that watches a resource's desired state and continuously acts to drive actual state toward it.	00-foundations/06-declarative-api-model.md
reconciliation	The core operating principle: observe actual, diff against desired, act to close the gap — repeated forever.	00-foundations/06-declarative-api-model.md
level-triggered	Reacting to the current state (re-derived every loop) rather than to individual events; why Kubernetes self-heals after missed events.	00-foundations/06-declarative-api-model.md
resourceVersion	An opaque per-object value reflecting etcd's revision at last write; the basis of optimistic concurrency and watch resumption.	00-foundations/06-declarative-api-model.md
Optimistic concurrency	Updates carry the `resourceVersion` they read; the apiserver commits only if it is still current, else returns `Conflict (409)`.	00-foundations/06-declarative-api-model.md
Watch	A streaming API that delivers object change events from a `resourceVersion`; how controllers and informers stay current.	00-foundations/04-control-plane-deep-dive.md
`kubectl apply` (3-way merge)	Declarative update that merges your manifest, the live object, and the last-applied config so it changes only fields you manage.	00-foundations/06-declarative-api-model.md
Server-Side Apply (SSA)	Apply where the apiserver tracks per-field ownership in `managedFields` and reports conflicts between managers.	00-foundations/06-declarative-api-model.md
managedFields / field manager	The `metadata.managedFields` record of which manager owns each field; the mechanism that makes co-ownership (e.g. Git + HPA) safe.	appendix/C-yaml-and-api-conventions.md
kubectl	The official CLI that talks to kube-apiserver to create, inspect, update, and delete resources.	00-foundations/07-local-cluster-setup.md
kubeconfig	The file (`~/.kube/config` or `$KUBECONFIG`) listing clusters, users, and contexts, with a current-context pointer.	00-foundations/07-local-cluster-setup.md
Context	A named (cluster + user + default namespace) tuple in kubeconfig; switching context switches what `kubectl` targets.	00-foundations/07-local-cluster-setup.md
kind / k3d	Tools that run a full Kubernetes cluster locally inside Docker containers (kind = "Kubernetes IN Docker"; k3d = k3s in Docker) — used for every hands-on.	00-foundations/07-local-cluster-setup.md
OCI image	A standardized container image format (layers + config); what a registry stores and the kubelet pulls.	00-foundations/02-containers-and-images.md
Image digest	The content-addressed `sha256:` identifier of an image; pinning by digest (not a mutable tag) makes deploys reproducible.	00-foundations/02-containers-and-images.md
distroless image	A minimal image with no shell/package manager (e.g. `gcr.io/distroless/static:nonroot`); small attack surface — debug with `kubectl debug`, not `exec sh`.	00-foundations/02-containers-and-images.md
Namespace	A virtual cluster scope for grouping/isolating resources and the unit for quotas, RBAC, and (with policy) network isolation. Introduced where the `bookstore` namespace is created (01-core-workloads/03); multi-tenant depth in 08-day-2-operations/04.	01-core-workloads/03-resources-and-qos.md

Workloads¶

Term	Definition	Covered in
Pod	The smallest deployable unit: one or more containers sharing a network namespace (one IP), IPC, and volumes, always co-scheduled on one node.	01-core-workloads/01-pods.md
initContainer	A container that runs to completion before app containers start; used for setup/wait-for-dependency steps.	01-core-workloads/01-pods.md
sidecar	A helper container co-located in a Pod (logging/proxy/sync). As of v1.29+ a native sidecar is an `initContainers[]` entry with `restartPolicy: Always` — it starts before and runs alongside the app containers without blocking Pod startup/Job completion.	01-core-workloads/01-pods.md
Adapter / Ambassador	Structural multi-container patterns: an adapter normalizes a container's output; an ambassador proxies its outbound connections.	01-core-workloads/01-pods.md
liveness probe	A periodic check; if it fails, the kubelet restarts the container (recovers a hung process).	01-core-workloads/02-health-and-lifecycle.md
readiness probe	A check that gates Service traffic; a not-Ready Pod is removed from its Service's EndpointSlice (no restart).	01-core-workloads/02-health-and-lifecycle.md
startup probe	A probe that disables liveness/readiness until the app has started, for slow-starting containers.	01-core-workloads/02-health-and-lifecycle.md
Pod lifecycle / phase	`Pending → Running → Succeeded/Failed`; plus container states and the `Ready` condition the kubelet writes.	01-core-workloads/02-health-and-lifecycle.md
`preStop` hook	A container lifecycle hook run before SIGTERM on termination; commonly a short sleep to drain in-flight connections.	01-core-workloads/02-health-and-lifecycle.md
Graceful termination	The shutdown sequence: removed from Endpoints, `preStop`, SIGTERM, then SIGKILL after `terminationGracePeriodSeconds`.	01-core-workloads/02-health-and-lifecycle.md
Resource request	The amount of CPU/memory a container is guaranteed and scheduled against (reserved on a node).	01-core-workloads/03-resources-and-qos.md
Resource limit	The maximum CPU/memory a container may use; exceeding memory → OOMKilled, exceeding CPU → throttled.	01-core-workloads/03-resources-and-qos.md
QoS class	`Guaranteed` / `Burstable` / `BestEffort`, derived from requests vs limits; drives eviction order under node pressure.	01-core-workloads/03-resources-and-qos.md
OOMKilled	A container terminated (exit 137) by the kernel OOM killer for exceeding its memory limit / node memory pressure.	01-core-workloads/03-resources-and-qos.md
LimitRange	A namespace policy setting default/min/max requests and limits for containers that omit them.	01-core-workloads/03-resources-and-qos.md
ResourceQuota	A namespace cap on aggregate resource consumption and object counts (CPU/memory, pods, PVCs, …).	08-day-2-operations/04-multi-tenancy-and-namespaces.md
ReplicaSet	A controller ensuring a specified number of identical Pod replicas; normally owned by a Deployment.	01-core-workloads/04-replicasets-and-deployments.md
Deployment	A workload resource managing stateless replicas via ReplicaSets, providing rolling updates, rollback, and revision history.	01-core-workloads/04-replicasets-and-deployments.md
Rolling update	The default Deployment strategy: incrementally replace old Pods with new ones bounded by `maxSurge`/`maxUnavailable`.	01-core-workloads/04-replicasets-and-deployments.md
Revision / rollback	A Deployment's recorded ReplicaSet history; `kubectl rollout undo` re-applies an earlier revision.	01-core-workloads/04-replicasets-and-deployments.md
StatefulSet	A workload for stateful apps: stable per-Pod network identity and ordinal, ordered rollout, and a per-Pod PVC via `volumeClaimTemplates`.	01-core-workloads/05-statefulsets.md
Headless Service	A Service with `clusterIP: None` that returns Pod IPs (and per-Pod DNS) directly; required by StatefulSets.	01-core-workloads/05-statefulsets.md
`volumeClaimTemplates`	A StatefulSet field that provisions a dedicated, stable PVC per Pod ordinal.	01-core-workloads/05-statefulsets.md
DaemonSet	Ensures a copy of a Pod runs on every (or a selected subset of) node — for node-level agents (log shippers, exporters, CNI).	01-core-workloads/06-daemonsets.md
Job	A workload that runs Pods to successful completion (with completions/parallelism/backoff), then stops.	01-core-workloads/07-jobs-and-cronjobs.md
CronJob	A controller that creates Jobs on a repeating cron schedule, with concurrency and history policies.	01-core-workloads/07-jobs-and-cronjobs.md
`ttlSecondsAfterFinished`	A Job field that auto-deletes the Job (and its Pods) a set time after it finishes.	01-core-workloads/07-jobs-and-cronjobs.md
Recreate strategy	A Deployment strategy that terminates all old Pods before creating new ones (brief downtime; needed when versions can't coexist).	01-core-workloads/08-deployment-strategies.md
Blue-green deployment	Run two full environments and switch traffic atomically from old (blue) to new (green).	01-core-workloads/08-deployment-strategies.md
Canary deployment	Shift a small fraction of traffic to a new version, observe, then promote or roll back.	01-core-workloads/08-deployment-strategies.md
Singleton service / leader election	Ensuring exactly one active instance (e.g. a controller) via a Lease-based leader election.	08-day-2-operations/05-operators-and-crds.md

Networking¶

Term	Definition	Covered in
Kubernetes networking model	Every Pod gets its own IP and all Pods can reach each other without NAT; the contract a CNI plugin implements.	02-networking/01-networking-model.md
CNI (Container Network Interface)	The plugin API that wires Pod network namespaces and assigns Pod IPs (Calico, Cilium, kindnet, …).	02-networking/01-networking-model.md
Pod IP / Pod CIDR	The per-Pod IP and the cluster's Pod address range, allocated by the CNI/IPAM.	02-networking/01-networking-model.md
Service	A stable virtual endpoint (name + ClusterIP) that load-balances to a label-selected, dynamic set of Pods.	02-networking/02-services.md
ClusterIP	The default Service type: a stable in-cluster virtual IP, not reachable from outside the cluster.	02-networking/02-services.md
NodePort	A Service type that also exposes the Service on a static port on every node.	02-networking/02-services.md
LoadBalancer	A Service type that provisions an external cloud load balancer (no-op on bare local clusters).	02-networking/02-services.md
ExternalName	A Service that maps a name to an external DNS CNAME, with no proxying.	02-networking/03-dns-and-discovery.md
EndpointSlice	The scalable object listing a Service's ready backend Pod IPs/ports; replaced the legacy `Endpoints` object.	02-networking/02-services.md
Endpoints	The legacy per-Service object listing backend addresses; superseded by EndpointSlice.	02-networking/02-services.md
Headless Service (discovery)	`clusterIP: None`; DNS returns the Pod IPs directly for client-side load balancing / stable identities.	02-networking/03-dns-and-discovery.md
CoreDNS	The cluster DNS server (a Deployment in `kube-system`) that resolves Service/Pod names to ClusterIPs/Pod IPs.	02-networking/03-dns-and-discovery.md
Service FQDN	`<SVC>.<NS>.svc.cluster.local`; the fully-qualified name CoreDNS resolves for a Service.	02-networking/03-dns-and-discovery.md
`ndots` / search domains	`resolv.conf` settings that cause short names to be tried with appended search domains (a classic latency footgun).	02-networking/03-dns-and-discovery.md
Ingress	An API object defining HTTP/HTTPS host/path routing from outside the cluster to Services, realized by an ingress controller.	02-networking/04-ingress.md
Ingress controller	The component (e.g. ingress-nginx) that watches Ingress objects and programs an actual L7 proxy.	02-networking/04-ingress.md
IngressClass	Selects which ingress controller implements a given Ingress object.	02-networking/04-ingress.md
TLS termination	Decrypting HTTPS at the edge (Ingress/Gateway) using a certificate from a TLS Secret.	02-networking/04-ingress.md
Gateway API	The successor to Ingress: role-oriented CRDs (`GatewayClass`, `Gateway`, `HTTPRoute`) for richer, portable L4/L7 routing. (Also see 13 ch.07 for the v2 edge — Istio Gateway + Coraza WAF + per-tenant rate limiting via Envoy.)	02-networking/05-gateway-api.md
GatewayClass / Gateway / HTTPRoute	Gateway API kinds: the implementation class, a configured listener/data-plane, and the route rules attached to it.	02-networking/05-gateway-api.md
NetworkPolicy	A namespaced firewall: label-selected ingress/egress allow rules; a Pod selected by any policy defaults to deny for that direction.	02-networking/06-network-policies.md
default-deny	A NetworkPolicy selecting all Pods with no rules, so only explicitly-allowed traffic flows (zero-trust baseline).	02-networking/06-network-policies.md
Network segmentation	The pattern of partitioning Pod-to-Pod traffic with NetworkPolicies so a compromise can't move laterally.	02-networking/06-network-policies.md
Service mesh	An L7 networking layer (Istio/Linkerd) adding mTLS, traffic shaping, and telemetry via sidecars/proxies; conceptual-only in this guide.	02-networking/02-services.md

Configuration & storage¶

Term	Definition	Covered in
ConfigMap	An API object holding non-confidential key/value config, consumable as env vars, `envFrom`, or mounted files.	03-config-and-storage/01-configmaps.md
`envFrom`	A field that injects all keys of a ConfigMap/Secret as environment variables into a container.	03-config-and-storage/01-configmaps.md
Immutable ConfigMap/Secret	`immutable: true` freezes the data, improving performance and preventing accidental edits (replace, don't mutate).	03-config-and-storage/01-configmaps.md
Secret	Like a ConfigMap but for sensitive data; values are base64-encoded (not encrypted by default) — RBAC and encryption-at-rest are separate.	03-config-and-storage/02-secrets.md
Encryption at rest	Apiserver-level encryption of Secret data in etcd, configured via an `EncryptionConfiguration` (optionally a KMS provider).	05-security/04-secrets-and-cluster-hardening.md
External Secrets Operator	An operator that syncs secrets from an external store (Vault/cloud SM) into Kubernetes Secrets.	05-security/04-secrets-and-cluster-hardening.md
Sealed Secrets	A controller that decrypts a Git-safe `SealedSecret` into a real Secret in-cluster.	05-security/04-secrets-and-cluster-hardening.md
SOPS / KSOPS	SOPS encrypts secret values in Git; KSOPS is the Kustomize plugin that decrypts a SOPS `secretGenerator` at build time.	07-delivery/02-packaging-kustomize.md
Downward API	A mechanism exposing Pod/container metadata (name, namespace, labels, resource limits) to the container via env or files.	03-config-and-storage/03-volumes.md
Volume	Storage mounted into a Pod's containers; lifetime and semantics depend on the volume type.	03-config-and-storage/03-volumes.md
emptyDir	A scratch volume created empty when a Pod is assigned to a node and deleted with the Pod (the canonical writable path under a read-only root FS).	03-config-and-storage/03-volumes.md
hostPath	A volume mounting a path from the node's filesystem; powerful and risky — forbidden by PSA `restricted`/`baseline`.	03-config-and-storage/03-volumes.md
projected volume	A volume combining several sources (ServiceAccount token, ConfigMap, Secret, downwardAPI) into one directory.	03-config-and-storage/03-volumes.md
PersistentVolume (PV)	A cluster-scoped piece of provisioned storage with a lifecycle independent of any Pod.	03-config-and-storage/04-persistent-storage.md
PersistentVolumeClaim (PVC)	A namespaced request for storage (size, access mode, StorageClass) that binds to a PV, giving a Pod durable storage.	03-config-and-storage/04-persistent-storage.md
StorageClass	A named storage "tier" enabling dynamic PV provisioning via a CSI driver, with parameters and a reclaim/binding policy.	03-config-and-storage/04-persistent-storage.md
CSI (Container Storage Interface)	The plugin API for storage drivers that provision/attach/mount volumes (cloud disks, local-path, …).	03-config-and-storage/04-persistent-storage.md
Access mode	A PV/PVC capability: `ReadWriteOnce` (one node), `ReadWriteOncePod`, `ReadOnlyMany`, `ReadWriteMany`.	03-config-and-storage/04-persistent-storage.md
`WaitForFirstConsumer`	A StorageClass `volumeBindingMode` that delays PV binding until a Pod is scheduled, so storage lands on the right topology.	03-config-and-storage/04-persistent-storage.md
Reclaim policy	What happens to a PV when its PVC is deleted: `Delete` (free the disk) or `Retain` (keep for manual recovery).	03-config-and-storage/04-persistent-storage.md
VolumeSnapshot / VolumeSnapshotClass	A point-in-time copy of a PVC via a CSI snapshotter, and the class/driver that creates it.	03-config-and-storage/05-stateful-data-patterns.md
Stateful data patterns	Operational practices for data in Kubernetes: backups, migrations as Jobs, single-writer access, operators for databases.	03-config-and-storage/05-stateful-data-patterns.md
`fsGroup` / `fsGroupChangePolicy`	A pod `securityContext` that group-owns a volume so a non-root process can write it; `OnRootMismatch` skips a slow recursive chown.	05-security/02-pod-security.md

Scheduling¶

Term	Definition	Covered in
Scheduling (filter & score)	The scheduler's two phases: discard infeasible nodes (predicates), then rank the rest (priorities) and bind the best.	04-scheduling/01-scheduler-and-nodes.md
nodeSelector	The simplest node constraint: schedule only onto nodes carrying the given labels.	04-scheduling/02-affinity-taints-topology.md
Node affinity	Expressive node constraints (`required`/`preferred`) over node labels.	04-scheduling/02-affinity-taints-topology.md
Pod affinity / anti-affinity	Co-locate (affinity) or spread apart (anti-affinity) Pods relative to other Pods by label and topology key.	04-scheduling/02-affinity-taints-topology.md
Taint	A node mark (`key=value:effect`) that repels Pods unless they tolerate it.	04-scheduling/02-affinity-taints-topology.md
Toleration	A Pod field allowing it to schedule onto nodes with a matching taint.	04-scheduling/02-affinity-taints-topology.md
Taint effect	`NoSchedule`, `PreferNoSchedule`, or `NoExecute` (also evicts already-running, non-tolerating Pods).	04-scheduling/02-affinity-taints-topology.md
Topology spread constraints	Rules that even out Pods across a topology domain (zone/node) within a `maxSkew`.	04-scheduling/02-affinity-taints-topology.md
PriorityClass	A named priority value for Pods; higher-priority pending Pods can preempt lower-priority ones.	04-scheduling/03-priority-and-preemption.md
Preemption	The scheduler evicting lower-priority Pods to make room for a pending higher-priority Pod that otherwise can't fit.	04-scheduling/03-priority-and-preemption.md
Eviction	Removal of a running Pod — by the kubelet under node pressure, by preemption, or via the Eviction API (respects PDBs).	06-production-readiness/05-reliability-and-disruptions.md
Binding	The act of assigning a Pod to a node (writing `spec.nodeName`), which the kubelet then actuates.	04-scheduling/01-scheduler-and-nodes.md

Security¶

Term	Definition	Covered in
Authentication (authN)	Proving identity to the apiserver via a trusted authenticator (client cert, SA token, OIDC); Kubernetes stores no user records.	05-security/01-authn-authz-rbac.md
Authorization (authZ)	Deciding if an authenticated identity may perform a request; authorizers (Node, RBAC, …) are OR'd.	05-security/01-authn-authz-rbac.md
Admission control	Post-authZ plugins that mutate then validate a request before it is persisted (PSA, ResourceQuota, webhooks).	05-security/01-authn-authz-rbac.md
RBAC (Role-Based Access Control)	Roles/ClusterRoles grant verbs on resources; RoleBindings/ClusterRoleBindings bind them to subjects. Purely additive (no deny).	05-security/01-authn-authz-rbac.md
Role / ClusterRole	A namespaced (Role) or cluster-scoped/reusable (ClusterRole) set of permission rules.	05-security/01-authn-authz-rbac.md
RoleBinding / ClusterRoleBinding	Binds a (Cluster)Role to subjects within one namespace, or cluster-wide, respectively.	05-security/01-authn-authz-rbac.md
ServiceAccount (SA)	The in-cluster identity for workloads; every Pod runs as one (use a dedicated SA, never `default`).	05-security/01-authn-authz-rbac.md
Bound / projected SA token	A short-lived, audience-scoped JWT bound to the Pod, issued via the TokenRequest API and auto-rotated (replaces legacy forever-tokens).	05-security/01-authn-authz-rbac.md
`automountServiceAccountToken: false`	Stops the SA token from being mounted into a Pod that never calls the API (least privilege).	05-security/01-authn-authz-rbac.md
`system:masters`	A super-group that bypasses RBAC entirely (unconditional cluster-admin); the kubeadm admin cert carries it — treat as break-glass.	05-security/01-authn-authz-rbac.md
`auth can-i` / SubjectAccessReview	The API (and CLI) that asks the real authorizer whether an identity may do something — the authoritative audit tool.	05-security/01-authn-authz-rbac.md
Impersonation (`--as`)	An RBAC-gated power (`impersonate` verb) to act as another user/group/SA, used to test policy.	05-security/01-authn-authz-rbac.md
OIDC	OpenID Connect: how humans authenticate via an external identity provider; claims map to username/groups. (Also see 13 ch.04 for the Keycloak code+PKCE flow + JWKS + Istio JWT validation on the v2 platform; 10 ch.03 for IRSA's OIDC-federation form.)	05-security/01-authn-authz-rbac.md
`securityContext`	Pod/container fields that drop privileges: `runAsNonRoot`, `runAsUser`, capabilities, `readOnlyRootFilesystem`, seccomp, …	05-security/02-pod-security.md
Linux capabilities	Fine-grained slices of root's power; hardened Pods `drop: ["ALL"]` and add back only what's proven necessary.	05-security/02-pod-security.md
`allowPrivilegeEscalation: false`	Sets `no_new_privs` so a child process can't gain more privilege than its parent (neutralizes setuid).	05-security/02-pod-security.md
`privileged` container	A container with nearly all capabilities and device access (≈ root on the node); forbidden by `baseline`/`restricted`.	05-security/02-pod-security.md
`readOnlyRootFilesystem`	Mounts the container root FS read-only (write only via explicit volumes); strong breakout mitigation, not a PSA requirement.	05-security/02-pod-security.md
seccomp / `RuntimeDefault`	A syscall filter; `restricted` requires `seccompProfile.type` set to `RuntimeDefault` (or `Localhost`), never `Unconfined`.	05-security/02-pod-security.md
AppArmor	A Linux Security Module confining a process to a profile; now a first-class field (`appArmorProfile`, GA in v1.30).	05-security/02-pod-security.md
SELinux	The RHEL-family LSM labeling processes/files; configured via `securityContext.seLinuxOptions`.	05-security/02-pod-security.md
Pod Security Admission (PSA)	The built-in, non-mutating validating admission controller enforcing a Pod Security Standard via namespace labels.	05-security/02-pod-security.md
Pod Security Standards	The three fixed policy levels: `privileged`, `baseline`, `restricted`.	05-security/02-pod-security.md
PSA modes / `-version`	`enforce` (reject), `audit` (log), `warn` (client warning); the `-version` label pins the ruleset to a Kubernetes minor.	05-security/02-pod-security.md
PodSecurityPolicy (PSP)	The removed (v1.25) predecessor to PSA; any guide still recommending it is describing a dead API.	05-security/02-pod-security.md
Supply chain security	Trusting what you run: signed images, SBOMs, vulnerability scanning, and admission policy gating untrusted artifacts.	05-security/03-supply-chain.md
Trivy	An open-source scanner for image/filesystem/IaC vulnerabilities and misconfigurations, run in CI.	05-security/03-supply-chain.md
Cosign	A Sigstore tool to sign and verify container images (and other OCI artifacts), enabling signature-based admission.	05-security/03-supply-chain.md
SBOM (Software Bill of Materials)	A machine-readable inventory of an artifact's components/dependencies (SPDX/CycloneDX) for provenance and CVE triage.	05-security/03-supply-chain.md
Kyverno	A Kubernetes-native policy engine (validate/mutate/generate) used as a validating admission webhook.	05-security/03-supply-chain.md
Admission webhook (mutating/validating)	An external HTTP callback the apiserver invokes during admission to mutate or validate objects.	05-security/01-authn-authz-rbac.md
ValidatingAdmissionPolicy	In-tree, CEL-based validating admission policy — a webhook-free alternative for many policy checks.	05-security/03-supply-chain.md
Audit logging	The apiserver's structured record of every request (who/what/verdict); how RBAC/PSA decisions are reviewed and alarmed on.	05-security/04-secrets-and-cluster-hardening.md
CIS Benchmark	A consensus security configuration baseline for Kubernetes used to harden the cluster.	05-security/04-secrets-and-cluster-hardening.md
Cluster hardening	Reducing cluster attack surface: encryption at rest, audit, restricted RBAC, network policy, no `system:masters` sprawl.	05-security/04-secrets-and-cluster-hardening.md

Production readiness — observability, scaling, reliability¶

Term	Definition	Covered in
Observability	The ability to understand system state from its outputs — the three signals: metrics, logs, traces.	06-production-readiness/01-observability-metrics.md
Prometheus	A pull-based time-series metrics system that scrapes `/metrics` endpoints and stores samples for querying/alerting.	06-production-readiness/01-observability-metrics.md
PromQL	Prometheus's query language for selecting and aggregating time series (rates, quantiles, alerts).	06-production-readiness/01-observability-metrics.md
ServiceMonitor	A Prometheus Operator CRD declaring which Services/endpoints Prometheus should scrape.	06-production-readiness/01-observability-metrics.md
PrometheusRule	A Prometheus Operator CRD defining recording and alerting rules.	06-production-readiness/01-observability-metrics.md
Prometheus Operator	The operator that manages Prometheus/Alertmanager and consumes ServiceMonitor/PrometheusRule CRDs.	06-production-readiness/01-observability-metrics.md
kube-prometheus-stack	The Helm chart bundling Prometheus Operator, Prometheus, Alertmanager, and Grafana.	06-production-readiness/01-observability-metrics.md
Grafana	A dashboarding/visualization tool that queries Prometheus (and other sources).	06-production-readiness/01-observability-metrics.md
metrics-server	A lightweight cluster aggregator of CPU/memory for `kubectl top` and the HPA's resource metrics.	06-production-readiness/04-autoscaling.md
Logging architecture	Containers write to stdout/stderr; a node DaemonSet ships logs to a backend (no app-side log files).	06-production-readiness/02-logging.md
Loki	A horizontally-scalable log aggregation system that indexes labels, queried with LogQL.	06-production-readiness/02-logging.md
Tracing / distributed tracing	Following one request across services as a trace of timed spans, to find latency and failures.	06-production-readiness/03-tracing.md
OpenTelemetry (OTel)	The vendor-neutral standard/SDKs and Collector for emitting and exporting traces/metrics/logs.	06-production-readiness/03-tracing.md
Span / trace context	A span is one timed operation; trace context (e.g. W3C `traceparent`) is propagated so spans join into one trace.	06-production-readiness/03-tracing.md
Autoscaling	Automatically adjusting capacity to demand: pod count (HPA/KEDA), pod size (VPA), or node count (Cluster Autoscaler).	06-production-readiness/04-autoscaling.md
HPA (HorizontalPodAutoscaler)	Scales a workload's replica count on observed metrics (CPU/memory or custom); `autoscaling/v2`.	06-production-readiness/04-autoscaling.md
VPA (VerticalPodAutoscaler)	Recommends/sets right-sized requests/limits for a workload's containers.	06-production-readiness/04-autoscaling.md
KEDA	Event-driven autoscaling via a `ScaledObject` on external sources (queue depth, etc.). KEDA creates & manages an HPA for scaling above zero, and takes direct control for scale-to-zero (below the HPA minimum of 1).	06-production-readiness/04-autoscaling.md
`ScaledObject` / `TriggerAuthentication`	KEDA CRDs: the scaling target+triggers, and the credentials a trigger uses.	06-production-readiness/04-autoscaling.md
Cluster Autoscaler / Karpenter	Node-level autoscalers that add/remove nodes when Pods are unschedulable or nodes are underutilized.	06-production-readiness/06-capacity-and-cost.md
PodDisruptionBudget (PDB)	A floor on available replicas during voluntary disruptions (drains/upgrades); the Eviction API respects it.	06-production-readiness/05-reliability-and-disruptions.md
Voluntary vs involuntary disruption	Operator-initiated (drain, rollout — PDB applies) vs unavoidable (node crash, OOM — PDB does not apply).	06-production-readiness/05-reliability-and-disruptions.md
SLI / SLO / error budget	A measured indicator, a target for it, and the allowed amount of failure before you stop shipping risky changes.	06-production-readiness/05-reliability-and-disruptions.md
Capacity planning	Sizing requests/limits and node pools so workloads fit with headroom, balancing reliability against cost.	06-production-readiness/06-capacity-and-cost.md
Cost allocation / OpenCost	Attributing cluster spend to namespaces/workloads via labels; OpenCost is the CNCF cost model.	06-production-readiness/06-capacity-and-cost.md
Bin packing	Scheduling Pods densely onto fewer nodes (via requests) to reduce idle, traded against blast radius.	06-production-readiness/06-capacity-and-cost.md
Spot / preemptible nodes	Cheap, reclaimable cloud capacity used for fault-tolerant workloads (paired with PDBs and disruption handling).	06-production-readiness/06-capacity-and-cost.md

Delivery — packaging, CI/CD, GitOps¶

Term	Definition	Covered in
Helm	A package manager for Kubernetes: a templated, versioned chart is rendered with values and installed as a tracked release.	07-delivery/01-packaging-helm.md
Chart	A directory of templated manifests + `values.yaml` + `Chart.yaml` (metadata/version); inert until rendered.	07-delivery/01-packaging-helm.md
Values / `values.schema.json`	A chart's documented tuning surface (precedence: `--set` > `-f` > defaults), optionally JSON-Schema validated.	07-delivery/01-packaging-helm.md
Release / revision	A named install of a chart, stored as an in-cluster Secret; each `helm upgrade` is a new revision (`helm rollback`).	07-delivery/01-packaging-helm.md
Helm hook	A chart object annotated to run at a lifecycle event (e.g. `post-install,post-upgrade` for the DB-migrate Job).	07-delivery/01-packaging-helm.md
`_helpers.tpl` / named template	Reusable template snippets `include`d across a chart so repeated YAML (labels, security context, DSN) can't drift.	07-delivery/01-packaging-helm.md
Library vs application chart	`type: application` installs into a cluster; `type: library` ships only helpers for other charts to `include`.	07-delivery/01-packaging-helm.md
Tiller	The removed Helm 2 in-cluster server (a cluster-admin backdoor); Helm 3 is client-only.	07-delivery/01-packaging-helm.md
Kustomize	Template-free customization: a base of plain manifests plus typed, declarative overlays/patches; built into `kubectl`.	07-delivery/02-packaging-kustomize.md
Base / overlay	The deployable app (base) and a small per-environment diff (overlay) that includes it and layers transformations.	07-delivery/02-packaging-kustomize.md
Component (Kustomize)	A reusable, optional `kind: Component` mix-in an overlay opts into (the analog of a Helm value toggle).	07-delivery/02-packaging-kustomize.md
Strategic-merge vs JSON6902 patch	A field-aware partial-object merge vs a precise RFC-6902 `op/path/value`; both under the unified `patches:`.	07-delivery/02-packaging-kustomize.md
`commonLabels` footgun	Adding labels via `commonLabels` mutates immutable `spec.selector` and wedges upgrades; use `labels:` with `includeSelectors: false`.	07-delivery/02-packaging-kustomize.md
configMap/secretGenerator	Kustomize generators that synthesize a ConfigMap/Secret and append a content-hash suffix to trigger rollouts on change.	07-delivery/02-packaging-kustomize.md
CI/CD pipeline	Automated build → test → scan → image push → manifest update; the path from commit to a deployable artifact.	07-delivery/03-cicd-pipeline.md
In-cluster build (Kaniko/BuildKit)	Building container images inside the cluster without a Docker daemon (a CI alternative, note-only in this guide).	07-delivery/03-cicd-pipeline.md
GitOps	Git is the single source of truth; a controller continuously reconciles the cluster to the repo and auto-corrects drift.	07-delivery/04-gitops-argocd.md
Argo CD	A GitOps controller that renders manifests/charts/kustomizations from Git and reconciles them into clusters.	07-delivery/04-gitops-argocd.md
Argo CD `Application`	The CRD binding a Git source (repo/path/revision) to a destination (cluster/namespace) with a sync policy.	07-delivery/04-gitops-argocd.md
Argo CD `AppProject`	A CRD constraining which repos/clusters/namespaces/kinds a group of Applications may use (multi-tenancy guardrails).	07-delivery/04-gitops-argocd.md
App-of-Apps / ApplicationSet	Patterns to manage many Applications from one (App-of-Apps) or generate them from a generator (ApplicationSet).	07-delivery/04-gitops-argocd.md
Sync wave / hook (Argo)	Ordering primitives that stage a sync (e.g. run the DB-migrate Job before the app), analogous to Helm hooks.	07-delivery/04-gitops-argocd.md
Drift detection	A GitOps controller noticing live state diverging from Git and reporting/reverting it.	07-delivery/04-gitops-argocd.md
Progressive delivery	Automated, metric-gated rollout (canary/blue-green) that promotes or rolls back based on analysis.	07-delivery/05-progressive-delivery.md
Argo Rollouts / `Rollout`	A controller and CRD replacing Deployment to drive canary/blue-green steps with analysis gates.	07-delivery/05-progressive-delivery.md
AnalysisTemplate / AnalysisRun	Argo Rollouts CRDs defining the success metrics queried during a rollout step and one execution of them.	07-delivery/05-progressive-delivery.md

Day-2 operations¶

Term	Definition	Covered in
Cluster lifecycle	Provisioning, upgrading, and decommissioning clusters and nodes over time (managed service, kubeadm, or Cluster API).	08-day-2-operations/01-cluster-lifecycle.md
kubeadm	The upstream tool that bootstraps a conformant control plane and joins nodes.	08-day-2-operations/01-cluster-lifecycle.md
Cluster API (CAPI)	A Kubernetes-style declarative API for provisioning and managing clusters as objects.	08-day-2-operations/01-cluster-lifecycle.md
Version skew	The supported version differences between control plane, kubelets, and `kubectl` (±1 minor) — mismatch causes odd errors.	08-day-2-operations/01-cluster-lifecycle.md
cordon / drain / uncordon	Mark a node unschedulable, evict its Pods respecting PDBs, then return it to rotation — the safe node-maintenance sequence.	08-day-2-operations/01-cluster-lifecycle.md
Backup & disaster recovery (DR)	Capturing etcd and persistent data and the rehearsed procedure to restore service after loss.	08-day-2-operations/02-backup-and-dr.md
etcd snapshot	A point-in-time backup of etcd (`etcdctl snapshot save`); backing up etcd is backing up cluster state.	08-day-2-operations/02-backup-and-dr.md
Velero	A tool that backs up/restores cluster objects and PV data (snapshots) and supports migration/DR.	08-day-2-operations/02-backup-and-dr.md
RPO / RTO	Recovery Point Objective (max tolerable data loss) and Recovery Time Objective (max tolerable downtime).	08-day-2-operations/02-backup-and-dr.md
Troubleshooting method	The fixed pipeline: observe → isolate → hypothesize → test → fix → verify; `describe`→events→logs→`kubectl debug`.	08-day-2-operations/03-troubleshooting-playbook.md
`kubectl debug` / ephemeral container	Inject a tooling container into a running Pod (shares its PID/network namespaces) — the correct way to debug distroless Pods.	08-day-2-operations/03-troubleshooting-playbook.md
`kubectl debug --profile`	Shapes the debug container's security (`restricted`/`general`/`sysadmin`/`netadmin`) so PSA admits it (GA in v1.30).	08-day-2-operations/03-troubleshooting-playbook.md
CrashLoopBackOff	A container that starts then exits repeatedly, with exponential backoff; diagnose with `logs --previous`.	08-day-2-operations/03-troubleshooting-playbook.md
ImagePullBackOff	The kubelet cannot pull the image (bad ref/tag, missing pull secret, not loaded) and is backing off.	08-day-2-operations/03-troubleshooting-playbook.md
CreateContainerConfigError	A referenced ConfigMap/Secret key is missing, so the container spec can't be materialized.	08-day-2-operations/03-troubleshooting-playbook.md
Multi-tenancy	Sharing a cluster across teams/apps with isolation via namespaces, RBAC, quotas, and NetworkPolicies.	08-day-2-operations/04-multi-tenancy-and-namespaces.md
CRD (CustomResourceDefinition)	An extension registering a new resource type with the apiserver so custom objects are served like built-ins.	08-day-2-operations/05-operators-and-crds.md
Custom Resource (CR)	An instance of a CRD-defined kind, stored and watched like any object.	08-day-2-operations/05-operators-and-crds.md
Operator	A custom controller encoding operational knowledge for an app (e.g. a database), reconciling its CRs.	08-day-2-operations/05-operators-and-crds.md
Operator pattern	CRD (the API) + controller (the reconcile loop) = automated day-2 operations for stateful/complex software.	08-day-2-operations/05-operators-and-crds.md
Reconcile loop (controller-runtime)	The operator's `Reconcile(req)` function: read the CR, observe reality, act to converge, requeue.	08-day-2-operations/05-operators-and-crds.md
CloudNativePG (CNPG)	A PostgreSQL operator managing HA Postgres clusters (failover, backups) via its `Cluster` CRD.	08-day-2-operations/05-operators-and-crds.md
finalizer	A `metadata.finalizers` entry that blocks deletion until a controller does cleanup, then removes the finalizer.	08-day-2-operations/05-operators-and-crds.md
ownerReference / garbage collection	A child's link to its parent; deleting the parent cascades GC of children (e.g. ReplicaSet → Pods).	08-day-2-operations/05-operators-and-crds.md
Structural schema / OpenAPI validation	The CRD's required OpenAPI v3 schema that the apiserver uses to validate and prune custom resources.	08-day-2-operations/05-operators-and-crds.md
Conversion webhook	A webhook that converts a CRD's custom resources between served API versions.	08-day-2-operations/05-operators-and-crds.md
API deprecation policy	The rule that GA APIs are supported for a defined window and removed only after deprecation; pin versions and migrate deliberately.	appendix/C-yaml-and-api-conventions.md
alpha / beta / stable (GA)	API maturity levels: alpha (off by default, may change), beta (on, may change), stable (long-term support).	appendix/C-yaml-and-api-conventions.md
*`app.kubernetes.io/` labels**	The recommended common label set (`name`, `instance`, `version`, `component`, `part-of`, `managed-by`) for consistent selection/cost/ops.	00-foundations/06-declarative-api-model.md

Part 10 — Cloud & Managed Kubernetes¶

Term	Definition	Covered in
Managed Kubernetes	A cloud-vendor offering (EKS / GKE / AKS / DOKS / OKE / Linode LKE) where the control plane (apiserver + etcd + scheduler + controllers) is run, patched, scaled, backed up, and SLA'd by the provider; the customer owns the data plane (nodes, workloads, in-cluster security, networking wiring, IAM mapping).	10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md
Shared responsibility model	The dividing line between provider and customer on a managed cluster: provider = control plane and its SLA; customer = nodes, Pods, RBAC, app SLOs, and any cloud IAM glue.	10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md
EKS / GKE / AKS	The three large managed Kubernetes services: Amazon Elastic Kubernetes Service, Google Kubernetes Engine, and Azure Kubernetes Service.	10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md
Control-plane SLA	The provider's uptime guarantee for the apiserver / etcd (e.g. 99.95% on a regional EKS/GKE/AKS); your application SLO is separate and your responsibility.	10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md
Control-plane vs node-pool upgrade	Two separate upgrade operations on managed clusters: the provider upgrades the control plane (you trigger / it auto-upgrades), and you separately roll your node pools — each respects version skew.	10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md
IRSA (IAM Roles for Service Accounts)	AWS's pod-identity mechanism: a ServiceAccount is annotated with an IAM role ARN; the SA's projected OIDC token is exchanged with STS for short-lived AWS credentials. Solves "no static AWS keys in Pods".	10-cloud-and-managed-kubernetes/03-cloud-identity.md
EKS Pod Identity	AWS's newer alternative to IRSA: pod-identity association handled by an agent DaemonSet, no in-cluster OIDC role-arn annotations needed.	10-cloud-and-managed-kubernetes/03-cloud-identity.md
Workload Identity (GCP)	GKE's pod-identity mechanism: a Kubernetes SA is bound to a Google Service Account; pods exchange their projected SA token for a GSA token via the GKE metadata server.	10-cloud-and-managed-kubernetes/03-cloud-identity.md
Azure AD Workload Identity	Azure's pod-identity mechanism: a Kubernetes SA's federated credential is mapped to an Azure AD application/managed identity, exchanged via MSAL for an Azure token.	10-cloud-and-managed-kubernetes/03-cloud-identity.md
OIDC issuer URL	A cluster-published URL that signs ServiceAccount JWTs and exposes a JWKS; cloud IAM is configured to trust that issuer to enable pod-identity federation.	10-cloud-and-managed-kubernetes/03-cloud-identity.md
Pod identity federation	The general pattern: the cluster's OIDC-signed projected SA token is exchanged with a cloud STS for short-lived, scoped cloud credentials — no static keys ever stored.	10-cloud-and-managed-kubernetes/03-cloud-identity.md
VPC CNI (AWS)	The default EKS CNI: every Pod gets a real VPC IP assigned via ENI secondary IPs; pod IP density is bounded by per-instance ENI/IP limits unless prefix delegation is enabled.	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
Prefix delegation	A VPC CNI feature that assigns `/28` prefixes (16 IPs) per ENI instead of one IP each, raising per-instance Pod density on EKS.	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
GKE CNI / GKE Dataplane V2	GKE's CNI options — the legacy "kubenet" and the Cilium-based Dataplane V2 (eBPF, kube-proxy replacement, network policies).	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
Azure CNI Overlay	AKS's overlay CNI: Pods get an overlay IP from a per-node CIDR (not a VNet IP), removing the VNet IP-exhaustion footgun of Azure CNI "VNet" mode.	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
Cloud LoadBalancer / `Service type: LoadBalancer`	A Service that provisions a real cloud LB (NLB/ALB/internal LB on AWS, GCLB on GCP, Azure LB) via the cloud-controller-manager; a no-op on bare local clusters.	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
AWS Load Balancer Controller (LBC)	A controller that watches `Ingress` (and `Service`) objects and provisions ALBs / NLBs accordingly; the modern replacement for the older in-tree ELB integration.	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
Cloud CSI driver	A vendor CSI implementation that provisions/attaches cloud block or file volumes from a StorageClass (EBS / PD / Azure Disk for block-RWO; EFS / Filestore / Azure Files for file-RWX). Installed as a managed add-on or pinned Helm.	10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md
EBS-CSI / PD-CSI / Azure Disk CSI	The cloud block-storage CSI drivers (`ebs.csi.aws.com`, `pd.csi.storage.gke.io`, `disk.csi.azure.com`) — block, RWO, zonally bound, snapshot-capable.	10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md
EFS / Filestore / Azure Files	The cloud file-storage offerings (NFS-like, RWX) used when multiple Pods on multiple nodes need to share a volume.	10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md
Cloud snapshot / `VolumeSnapshot` on cloud CSI	A CSI-orchestrated point-in-time snapshot of a cloud disk (EBS / PD / Azure Disk snapshot) realized via the `VolumeSnapshot` API and the cloud snapshotter.	10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md
Cloud secret store (AWS Secrets Manager / GCP Secret Manager / Azure Key Vault)	Provider secret services consumed in-cluster via ESO or the CSI Secrets Store driver; the source of truth lives in the cloud, not in etcd.	10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md
Cloud-managed Prometheus (AMP / GMP)	AWS Managed Prometheus and Google Managed Service for Prometheus — provider-hosted Prometheus-compatible TSDBs you remote-write to from in-cluster scrapers.	10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md
Cluster Autoscaler (CA)	The classic node-level autoscaler: tied to ASGs / MIGs / VMSS, scales up when a Pod is unschedulable, down when a node is underutilized — node groups are pre-defined.	10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md
Karpenter	A node-level autoscaler that provisions right-sized EC2 instances just-in-time directly from EC2 fleet APIs (no ASG), consolidates nodes, mixes spot + on-demand, and binds Pod requirements (architecture / taints / topology) into the launch decision.	10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md
NodePool / `EC2NodeClass` (Karpenter)	Karpenter CRDs: `NodePool` declares Pod-targeted constraints (instance types, zones, taints, limits, weight) and disruption policy; `EC2NodeClass` declares the AWS-side launch template (AMI, subnets, security groups, IRSA).	10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md
Consolidation (Karpenter)	The continuous right-sizing loop: Karpenter periodically replaces under-used nodes with cheaper / smaller ones if all current Pods would still fit; the structural reason Karpenter often costs less than CA.	10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md
Spot / preemptible / Spot VMs	Provider-cheap reclaimable instances (AWS Spot, GCP preemptible/Spot VMs, Azure Spot VMs); used for fault-tolerant workloads paired with PDBs and disruption handlers.	10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md
Multi-AZ / multi-region	The two HA tiers: spreading nodes across availability zones in one region (cheap, common); replicating clusters across regions (expensive, only for the highest tier) — different blast radii and SLO costs.	10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md
Terraform / OpenTofu	Declarative HCL infra-as-code: cluster + node pools + VPC defined as resources; remote state (S3/GCS + lock) is the production discipline. The guide's reference IaC for managed clusters.	10-cloud-and-managed-kubernetes/02-provisioning-and-iac.md
`eksctl` / `gcloud container clusters create` / `az aks create`	The provider CLIs that stand up a managed cluster + a managed node group in one command — fastest path, weakest provenance versus IaC.	10-cloud-and-managed-kubernetes/02-provisioning-and-iac.md

Part 11 — Advanced Production Patterns¶

Term	Definition	Covered in
Operator pattern	A CRD + a controller that encodes operational knowledge for a stateful or complex application; "an SRE for that app, in code". (Introduced in Part 08 ch.05; built in Part 11 ch.02.)	11-advanced-production-patterns/02-operator-development.md
Kubebuilder	The upstream SIG scaffolding tool for Go operators built on `controller-runtime`; the guide's reference framework.	11-advanced-production-patterns/02-operator-development.md
Operator SDK	An alternative scaffolding tool (now sharing the `controller-runtime` foundation with Kubebuilder); also supports Helm/Ansible-based operators.	11-advanced-production-patterns/02-operator-development.md
`controller-runtime`	The Go library (manager, controller, client, cache, builder, finalizer helpers) underneath both Kubebuilder and Operator SDK.	11-advanced-production-patterns/02-operator-development.md
Status conditions / status subresource	The standard pattern for reporting reconcile outcome on a CR: an array of typed conditions (`type`, `status`, `reason`, `message`, `lastTransitionTime`) written via a dedicated `/status` subresource so spec/status writes don't clobber each other.	11-advanced-production-patterns/02-operator-development.md
envtest	The `controller-runtime` integration-test harness: spins up a real local `etcd` + `kube-apiserver` (no kubelet/scheduler) so reconcile logic can be tested against a real apiserver in CI.	11-advanced-production-patterns/02-operator-development.md
MutatingWebhookConfiguration / ValidatingWebhookConfiguration	The cluster-scoped CRD-like objects that register an admission webhook with the apiserver: which resources/verbs trigger it, which Service it calls, and its CA bundle.	11-advanced-production-patterns/01-admission-webhooks.md
`failurePolicy` / `sideEffects` / `reinvocationPolicy`	Admission-webhook safety knobs: `failurePolicy: Fail` rejects on outage, `Ignore` allows; `sideEffects: None` is the recommended honest default; `reinvocationPolicy: IfNeeded` allows a mutator to be re-invoked after later mutations.	11-advanced-production-patterns/01-admission-webhooks.md
APF (API Priority and Fairness)	The apiserver's built-in self-protection: incoming requests are classified by a FlowSchema into a PriorityLevelConfiguration, with shuffle-sharded fair queues — protects the apiserver from a single hot client.	11-advanced-production-patterns/03-api-priority-and-fairness.md
FlowSchema	The APF classifier: match by user / SA / group + verb / resource / namespace, with `matchingPrecedence` ordering and a `distinguisherMethod` (ByUser / ByNamespace) that picks the per-flow queue.	11-advanced-production-patterns/03-api-priority-and-fairness.md
PriorityLevelConfiguration	The APF queue level: `assuredConcurrencyShares` (relative weight of in-flight slots), `queueLengthLimit`, `queues`, `handSize`, `borrowingLimit` — defines the queueing/serving behavior for matched flows.	11-advanced-production-patterns/03-api-priority-and-fairness.md
Service mesh	An infrastructure layer (Istio / Linkerd / Cilium service mesh) that gives every service-to-service call mTLS-by-default identity, L7 traffic management (canary / mirror / retry / timeout / outlier detection), and uniform telemetry.	11-advanced-production-patterns/04-service-mesh.md
Istio	A mature service mesh; `Gateway` / `VirtualService` / `DestinationRule` + (newer) Gateway-API `HTTPRoute`, with two data-plane modes: sidecar (Envoy per Pod) and ambient.	11-advanced-production-patterns/04-service-mesh.md
Ambient mode (Istio)	A sidecar-less Istio data plane: an L4 `ztunnel` DaemonSet on every node does mTLS; an optional L7 waypoint Deployment per namespace handles HTTP-level policy — only the Pods that need L7 pay the L7 cost.	11-advanced-production-patterns/04-service-mesh.md
Waypoint proxy	The L7-policy proxy in Istio ambient: a Deployment per namespace (or per SA) that traffic for that scope routes through when L7 features are needed.	11-advanced-production-patterns/04-service-mesh.md
Linkerd	A lighter-weight CNCF service mesh: Rust micro-proxy sidecar, mTLS by default, simpler control plane than Istio.	11-advanced-production-patterns/04-service-mesh.md
SPIFFE / SPIRE	A cloud-native standard for workload identity: SPIFFE defines the SVID (an X.509 cert or JWT bound to a workload), SPIRE is the reference issuer; the substrate for mTLS between services across clusters/providers.	11-advanced-production-patterns/04-service-mesh.md
mTLS (mutual TLS)	Both peers present and verify certificates — the service-mesh default; in Istio, `PeerAuthentication: STRICT` enforces it cluster-wide.	11-advanced-production-patterns/04-service-mesh.md
External Secrets Operator (ESO)	A controller that pulls secrets from an external store (Vault / AWS Secrets Manager / GCP Secret Manager / Azure Key Vault / 1Password / …) into Kubernetes `Secret`s on a schedule; defined by `SecretStore` / `ClusterSecretStore` + `ExternalSecret`.	11-advanced-production-patterns/05-secrets-at-scale.md
`SecretStore` / `ClusterSecretStore`	ESO CRDs: the connection + auth to an external secret provider (namespaced or cluster-scoped).	11-advanced-production-patterns/05-secrets-at-scale.md
`ExternalSecret`	An ESO CRD that declares "produce a `Secret` named X with these keys mapped from this `SecretStore`"; refreshed on `refreshInterval`.	11-advanced-production-patterns/05-secrets-at-scale.md
Vault (HashiCorp)	A widely-used external secret store with dynamic-secret backends (DB credentials minted on demand), Transit (encryption-as-a-service), PKI, and a Kubernetes auth method that exchanges a projected SA token for a Vault token.	11-advanced-production-patterns/05-secrets-at-scale.md
Vault Agent Injector	A Vault mutating webhook that injects an init/sidecar Vault Agent into a Pod to render secrets to a file (instead of into a Kubernetes `Secret`); ESO's alternative path.	11-advanced-production-patterns/05-secrets-at-scale.md
CSI Secrets Store driver	A CSI driver (`secrets-store.csi.x-k8s.io`) that mounts secrets from external stores directly into a Pod as a volume (no `Secret` materialized), with optional `secretObjects` to sync into a real Secret.	11-advanced-production-patterns/05-secrets-at-scale.md
Dynamic secret	A short-lived credential minted on request (e.g. a Vault DB-engine username/password valid for 1h); contrast with a long-lived static secret in etcd.	11-advanced-production-patterns/05-secrets-at-scale.md
Multi-cluster / fleet	Running more than one cluster as a coordinated set, for blast-radius / region / hard-tenancy / regulatory reasons; topologies include per-env, per-region, per-tenant, and hub-and-spoke.	11-advanced-production-patterns/06-multi-cluster-and-fleet.md
Argo CD `ApplicationSet`	A template + generator (list / cluster / git / matrix / merge / scm / pull-request) that produces N `Application`s from one declaration — the GitOps multi-cluster primitive.	11-advanced-production-patterns/06-multi-cluster-and-fleet.md
Karmada	A CNCF multi-cluster orchestrator: workloads declared once at the Karmada control plane and propagated to member clusters via `PropagationPolicy` / `OverridePolicy`.	11-advanced-production-patterns/06-multi-cluster-and-fleet.md
Cluster API (CAPI)	A Kubernetes-style declarative API for provisioning and managing whole clusters as objects, with provider implementations (CAPA / CAPG / CAPZ / CAPV / …).	11-advanced-production-patterns/06-multi-cluster-and-fleet.md
Hub-and-spoke vs leader-and-followers	Two multi-cluster topologies: hub-and-spoke = one management cluster pushing to many workload clusters (Argo CD ApplicationSet, Karmada); leader-and-followers = symmetric clusters with one elected leader (Submariner, KubeFed-style).	11-advanced-production-patterns/06-multi-cluster-and-fleet.md
Chaos engineering	Disciplined experimentation that defines a steady-state hypothesis, bounds the blast radius, injects a failure, observes, and learns; not random breakage.	11-advanced-production-patterns/07-chaos-engineering.md
Chaos Mesh	A CNCF chaos-engineering platform with rich experiment CRDs: `PodChaos` (kill/failure/container-kill), `NetworkChaos` (latency/loss/partition), `StressChaos` (CPU/memory), `IOChaos`, `TimeChaos`, and a `Workflow` to chain them. (Also see 13 ch.12 for the quarterly chaos game-day discipline on the v2 platform.)	11-advanced-production-patterns/07-chaos-engineering.md
Litmus	Another CNCF chaos-engineering platform (ChaosExperiment / ChaosEngine / ChaosResult); a sibling option to Chaos Mesh in the same space.	11-advanced-production-patterns/07-chaos-engineering.md
Steady-state hypothesis	The pre-experiment statement of what "normal" looks like (latency / error rate / throughput); the experiment is judged by whether reality stays within it.	11-advanced-production-patterns/07-chaos-engineering.md
Blast radius	The bounded scope of a chaos experiment (one Pod / one namespace / one zone) — the discipline that distinguishes engineering from breakage.	11-advanced-production-patterns/07-chaos-engineering.md
HA control plane	The configuration where the apiserver and etcd are replicated (typically across 3 nodes / 3 zones) so a single-node failure does not lose the cluster; the deployment topology behind every managed cluster's control-plane SLA.	11-advanced-production-patterns/08-ha-control-plane-and-etcd.md
Stacked vs external etcd	Two HA topologies: stacked = etcd colocated on the control-plane nodes (default kubeadm HA); external = a dedicated etcd cluster on its own VMs (more isolation, more nodes to operate).	11-advanced-production-patterns/08-ha-control-plane-and-etcd.md
etcd raft quorum	The majority needed to commit a write to etcd: ⌈(N+1)/2⌉ — 3 nodes tolerate 1 loss, 5 tolerate 2; sizing the etcd cluster picks a fault tolerance vs latency point.	11-advanced-production-patterns/08-ha-control-plane-and-etcd.md
etcd defragmentation	`etcdctl defrag` reclaims space inside an etcd member's MVCC store after key compaction; routine maintenance to keep DB size and latency bounded.	11-advanced-production-patterns/08-ha-control-plane-and-etcd.md
Watch cache	The apiserver in-memory cache of watched-resource state that serves most LIST/WATCH from memory; defended by APF and tuned for control-plane scalability.	11-advanced-production-patterns/09-performance-and-scalability.md
kube-proxy modes	iptables vs IPVS vs (CNI-replaced) eBPF: how Service VIPs become real packet steering on a node — different scalability and latency profiles.	11-advanced-production-patterns/09-performance-and-scalability.md
eBPF / Cilium dataplane	An in-kernel programmable dataplane that replaces iptables-based kube-proxy and parts of CNI plumbing with eBPF programs; supports network policies, service routing, and observability with lower per-packet overhead.	11-advanced-production-patterns/09-performance-and-scalability.md
Conntrack table	The kernel's connection-tracking table that iptables/IPVS Service routing depends on; an under-sized `nf_conntrack_max` is a classic source of `connection refused`s under load.	11-advanced-production-patterns/09-performance-and-scalability.md
Pod-startup latency	The end-to-end time from `kubectl apply` to `Ready=true`: schedule + image pull + container start + probes; tuned via image size, pull policy, warm caches, and topology.	11-advanced-production-patterns/09-performance-and-scalability.md
Platform engineering	The discipline of building an Internal Developer Platform on top of Kubernetes so application teams get self-service + guardrails (a "paved road") without learning every primitive themselves.	11-advanced-production-patterns/10-platform-engineering.md
Internal Developer Platform (IDP)	The packaged product platform teams ship to developers: a curated set of self-service abstractions backed by Kubernetes, observability, CI/CD, secrets, and policy.	11-advanced-production-patterns/10-platform-engineering.md
Golden path / paved road	The opinionated default way to build, ship, and run a service on the platform — easy to follow, hard to escape; the productisation outcome of platform engineering.	11-advanced-production-patterns/10-platform-engineering.md
Crossplane	A control-plane-for-infra: install Providers (AWS / GCP / Azure / Helm / Kubernetes), define a CompositeResourceDefinition (XRD) for a high-level "API" your platform offers, and back it with a Composition of provider resources; users create a claim and Crossplane reconciles cloud infra. (Also see 13 ch.02 for the `BookstoreTenant` Composition used to onboard a tenant in one `kubectl apply`.)	11-advanced-production-patterns/10-platform-engineering.md
XRD / XR / claim (Crossplane)	`CompositeResourceDefinition` (the platform-team-authored API), `XR` (the cluster-scoped composite resource instance), and `claim` (the namespaced user-facing handle to it).	11-advanced-production-patterns/10-platform-engineering.md
Composition (Crossplane)	The template that maps an `XR` to a set of provider resources; the unit of platform-engineer authorship.	11-advanced-production-patterns/10-platform-engineering.md
Backstage	A Spotify-originated open-source developer portal: a software catalog + plugin framework + templates for scaffolding services; the canonical IDP UI. (Also see 13 ch.11 for Scaffolder + Software Catalog + TechDocs end-to-end on the v2 platform.)	11-advanced-production-patterns/10-platform-engineering.md

Part 12 — Kubernetes for Machine Learning¶

Term	Definition	Covered in
GPU (in Kubernetes)	A device exposed to Pods via the device-plugin model as an extended resource (`nvidia.com/gpu`, `amd.com/gpu`); countable, not overcommittable, scheduled like a request but provided whole.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
Device plugin	A node-local gRPC plugin (DaemonSet) the kubelet talks to (`ListAndWatch` + `Allocate`) to discover, advertise, and assign devices like GPUs / TPUs / FPGAs as extended resources.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
NVIDIA GPU Operator	A meta-operator that installs and lifecycle-manages the full NVIDIA GPU stack on a cluster: driver, container toolkit, device plugin, DCGM exporter, MIG manager, and NFD/GFD labels.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
NFD (Node Feature Discovery)	A DaemonSet that introspects each node (CPU features, kernel, PCI devices, …) and labels it with `feature.node.kubernetes.io/*` so scheduling can target capability, not just instance type.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
GFD (GPU Feature Discovery)	NFD's GPU companion: applies `nvidia.com/gpu.product`, `gpu.memory`, MIG / driver-version labels — what training jobs actually pin against.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
DCGM (Data Center GPU Manager)	NVIDIA's GPU telemetry stack; in Kubernetes shipped as the `dcgm-exporter` DaemonSet that exposes per-GPU metrics (utilization, memory, throttling, ECC) for Prometheus.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
MIG (Multi-Instance GPU)	NVIDIA hardware partitioning (A100/H100) that splits one physical GPU into up to seven isolated instances; each appears to Kubernetes as a separate schedulable GPU with its own memory + SM slice.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
MPS (Multi-Process Service)	NVIDIA's software CUDA-context multiplexing: multiple processes share a GPU concurrently (lower isolation than MIG, less wasted GPU); device-plugin support is opt-in.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
Time-slicing (GPU)	The simplest GPU sharing — the device plugin advertises N "GPUs" per physical GPU and time-multiplexes them; no isolation, only for dev / notebook workloads.	12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md
Gang scheduling	All-or-nothing scheduling: either the whole set of N coordinated workers is admitted at once, or none — prevents the partial-placement deadlock that breaks distributed-ML jobs.	12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md
JobSet	A SIG-batch CRD for a coordinated group of Jobs (replicated jobs, startup ordering, success/failure policies); the multi-node-training primitive that gang-scheduling layers sit on.	12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md
Kueue	A Kubernetes batch-job queue manager: `Workload`s wait in `LocalQueue`s that flow into `ClusterQueue`s with quotas per `ResourceFlavor`; gates admission so big jobs run only when whole-job resources are available.	12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md
ResourceFlavor / ClusterQueue / LocalQueue / Workload	Kueue CRDs: a flavor is a labeled / tainted node class (e.g. GPU A100 spot); a ClusterQueue sets quota over flavors and borrowing; a LocalQueue is the per-namespace entry point; a Workload is the queued job.	12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md
Volcano	A batch / gang-scheduling system originally born of Kube-batch; an alternative scheduler (`schedulerName: volcano`) with `PodGroup` semantics, queues, and fair-share — popular in HPC/ML shops.	12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md
Kubeflow Training Operator	An operator providing per-framework training CRDs that fan out and coordinate workers: `PyTorchJob`, `TFJob`, `MPIJob`, `PaddleJob`, `XGBoostJob`. Standard for multi-worker training on Kubernetes.	12-kubernetes-for-machine-learning/04-distributed-training.md
PyTorchJob / TFJob / MPIJob / PaddleJob	Training-operator CRDs that declare master/worker replica counts, the framework's rendezvous wiring, and shared envs; the operator handles failure semantics + cleanup.	12-kubernetes-for-machine-learning/04-distributed-training.md
KubeRay	The Kubernetes operator for Ray: `RayCluster` (a head + worker Pods), `RayJob` (submit a job + cluster lifecycle), `RayService` (online serving). The canonical Ray-on-Kubernetes deployment.	12-kubernetes-for-machine-learning/04-distributed-training.md
RayCluster / RayJob / Ray Train	KubeRay objects + library: `RayCluster` is the long-lived head+workers; `RayJob` is a transient job + ephemeral cluster; `Ray Train` is Ray's distributed-training library (Torch/TF/XGBoost backends).	12-kubernetes-for-machine-learning/04-distributed-training.md
`torchrun` / Horovod / NCCL	Distributed-training launchers / libraries: `torchrun` is PyTorch's rendezvous + worker launcher; Horovod is a multi-framework allreduce library (deprecated by many for native DDP/FSDP); NCCL is NVIDIA's GPU-to-GPU collective comm primitive (allreduce / allgather).	12-kubernetes-for-machine-learning/04-distributed-training.md
Allreduce / rendezvous	The two pillars of distributed training: rendezvous = workers discover each other and pick a master at startup; allreduce = the synchronous gradient-aggregation collective that all workers participate in each step.	12-kubernetes-for-machine-learning/04-distributed-training.md
`cleanPodPolicy`	Kubeflow Training Operator field controlling whether worker Pods are deleted after a job finishes (`All` / `Running` / `None`); useful for log retention vs scheduler turnover.	12-kubernetes-for-machine-learning/04-distributed-training.md
Elastic training	Training that tolerates workers joining / leaving mid-run (PyTorch Elastic / torch.distributed.elastic) — used on spot-heavy clusters.	12-kubernetes-for-machine-learning/04-distributed-training.md
Checkpointing	Periodically saving training state to a PV or object store so a job can resume after a worker / node loss; the "make distributed training cheap on spot" trick.	12-kubernetes-for-machine-learning/04-distributed-training.md
JupyterHub on Kubernetes (z2jh)	Zero-to-JupyterHub: the official Helm-deployed JupyterHub stack — hub + `configurable-http-proxy` + per-user singleuser server Pods spawned by `KubeSpawner`.	12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md
KubeSpawner	The JupyterHub spawner that creates per-user singleuser server Pods (+ PVC) from a profile selection; the bridge between hub auth and Kubernetes.	12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md
singleuser server / `profileList`	The per-user notebook Pod, and the hub config that offers an image + resource menu (CPU notebook / small GPU / large GPU / R / Spark) the user picks at spawn time.	12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md
Kubeflow Notebooks	The Kubeflow-native alternative to JupyterHub: `Notebook` CRD + controller that creates per-user notebook Pods inside Kubeflow's profile-scoped namespaces.	12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md
Data gravity	The principle that compute should run where the data lives because moving large datasets dominates cost / latency / egress; drives co-location of training and storage.	12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md
KServe	A Kubernetes-native model-serving platform: `InferenceService` + `ServingRuntime` deliver low-latency inference with autoscaling (Knative-serverless or raw `Deployment`), canary/shadow, transformer + explainer + predictor pipeline.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
`InferenceService`	The user-facing KServe CRD: declares model URI, framework (sklearn/tensorflow/pytorch/triton/…), traffic split, canary, and the runtime / predictor shape.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
`ServingRuntime`	A KServe CRD that defines a reusable per-framework runtime (container image, supported model formats, resource defaults); separates the "how to serve" from "what to serve".	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
Knative Serving (KServe serverless mode)	KServe's default mode: built on Knative Serving's autoscaler — scale-to-zero, request-load-driven scaling, revision history.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
RawDeployment mode (KServe)	KServe's alternative mode: plain `Deployment` + HPA (no Knative), used when scale-to-zero is unwanted or Knative is not installed.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
Predictor / transformer / explainer	The three optional Pods in a KServe `InferenceService`: the predictor (the model), the optional transformer (pre/post-processing), and the optional explainer (interpretability).	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
Seldon Core	A sibling model-serving platform with a richer DAG/pipeline model (`SeldonDeployment` / Core v2 `Model`+`Pipeline`); referenced as an alternative to KServe.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
NVIDIA Triton Inference Server	A high-performance multi-framework inference server (TensorRT / PyTorch / ONNX / Python backend, dynamic batching, ensembles) used as the `ServingRuntime` behind KServe / Seldon for GPU inference.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
Model canary / A-B / shadow	Progressive delivery at the model layer: a small traffic fraction to the new model (canary), an explicit split between models (A-B), or a copy of traffic without using its response (shadow) — KServe's `canaryTrafficPercent` is the canonical knob.	12-kubernetes-for-machine-learning/06-model-serving-and-inference.md
Argo Workflows	A CNCF Kubernetes-native workflow engine: `Workflow` (one run), `WorkflowTemplate` / `ClusterWorkflowTemplate` (reusable definitions), `CronWorkflow` (scheduled); DAG or Steps templates, parameter + artifact passing, retries, artifact stores (PVC / S3 / GCS / Azure).	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
`WorkflowTemplate` / `ClusterWorkflowTemplate`	Argo Workflows CRDs holding reusable workflow definitions (namespaced or cluster-scoped) referenced from `Workflow`s via `workflowTemplateRef`.	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
`CronWorkflow`	A cron-scheduled Argo Workflow; the "nightly retrain at 02:00" / "hourly batch inference" primitive.	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
Argo Events	The event-driven side of Argo: `EventBus` + `EventSource` + `Sensor` that trigger `Workflow`s from S3 puts, Kafka, GitHub webhooks, schedules, etc. — turns pipelines into reactive systems.	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
Kubeflow Pipelines (KFP)	The ML-pipeline DSL + backend in Kubeflow: pipelines defined in a Python DSL, compiled to a pipeline spec, executed by an orchestrator (the v2 backend uses Argo Workflows). `KFP v1` vs `v2` differ in SDK + IR / metadata.	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
Tekton Pipelines	A Kubernetes-native CI/CD pipeline engine (`Task` / `Pipeline` / `PipelineRun`); a sibling option to Argo Workflows often used for CI rather than ML pipelines (mentioned for contrast).	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
Katib	Kubeflow's hyperparameter-tuning operator: `Experiment` (the search), `Suggestion` (the proposed candidates from the algorithm), `Trial` (each candidate run). Supports grid / random / Bayesian / HyperBand / NAS.	12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md
MLflow	An MLOps tracking + model-registry tool: `mlflow.log_*` records params / metrics / artifacts per run; the registry promotes models through stages (None / Staging / Production / Archived) — the "where do my trained models go" answer. (Also see 13 ch.08 for the full closed loop: MLflow Registry → KServe canary → Alibi-Detect drift → Argo Events retrain.)	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Model registry	A versioned store of trained model artifacts + metadata + lineage + stage (e.g. MLflow Registry, KServe-integrated registries) — the source-of-truth a serving system pulls from.	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
ML lineage	The traceable graph of `data version → code commit → training run → metrics → produced model → serving deployment`; captured via MLflow + KFP metadata so an incident on a prod model can be traced back to its inputs.	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Model drift / data drift	Distributions in production diverging from training: data drift = input feature distribution moves; model drift = predictive performance degrades. Triggers retraining. (Also see 13 ch.08 for the drift-detected → Argo Events Sensor → retrain Workflow loop.)	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Alibi-Detect / Evidently	Open-source drift / outlier detection libraries; deployed in-cluster (often as a KServe component or a sidecar) to compute drift metrics over a production traffic window.	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
MLOps maturity (L0–L3)	The Google MLOps maturity model: L0 = manual notebook; L1 = ML code in CI/CD; L2 = ML pipeline in CI/CD with automated training; L3 = full continuous training + monitoring + auto-retraining.	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
OpenCost / Kubecost	The CNCF cost-allocation tool (OpenCost = spec + open-source; Kubecost = commercial product around it) that attributes cluster spend (CPU / memory / GPU / PV / network) to namespaces / labels / workloads — drives per-tenant cost. (Also see 13 ch.10 for the per-tenant + per-cluster + per-region story plus showback-vs-chargeback + FinOps maturity.)	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Per-tenant cost (ns = team = cost center)	The cost-allocation discipline: one namespace per team / project, labels `team` / `cost-center` on every workload, OpenCost rolls up the bill — the cleanest path to chargeback / showback.	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Feature store (Feast / Tecton)	A serving + offline store of curated features (low-latency online lookups in serving + batch joins in training) — solves train/serve skew. Mentioned in Part 12 ch.08 "what we didn't build".	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Data versioning (DVC / LakeFS / Pachyderm)	Tools that version datasets the way Git versions code, so a training run's inputs are reproducible; mentioned in Part 12 ch.08 "what we didn't build".	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md
Federated learning (Flower / FedML)	A training paradigm where models are trained on decentralized data without aggregating it (each client trains locally, only weights / gradients are shared); mentioned as a frontier in Part 12 ch.08.	12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md

Part 13 — Grand Capstone: Bookstore Platform v2 terms¶

Term	Definition	Covered in
Bookstore Platform v2	The Part 13 production artifact: the v1 Bookstore re-shaped along seven dimensions (multi-tenant + multi-region active-active + Keycloak OIDC + IRSA + mesh JWT + CDC-driven search + Kafka outbox payments + edge WAF + closed ML loop + three-pillar OTel + OpenCost FinOps + Backstage IDP + day-2 runbook); the production reality v1 was the teaching artifact for.	13-grand-capstone-bookstore-platform/01-bookstore-2-from-toy-to-platform.md
Tenant (v2 sense)	A bookstore owner on the v2 platform — concretely a namespace + a Kueue queue + per-tenant cloud resources (S3 bucket, RDS read-replica) + a Backstage catalog entry + a Crossplane claim — onboarded in one `kubectl apply` of a `BookstoreTenant` claim.	13-grand-capstone-bookstore-platform/02-tenancy-and-crossplane-onboarding.md
`BookstoreTenant` Composition	The Crossplane Composition that backs the `BookstoreTenant` XR/claim: one declarative apply fans out into a `Namespace`, `ResourceQuota`, `LocalQueue`, Argo CD `Application`, S3 bucket, RDS read-replica, IRSA `Role`, Backstage `Component`, and observability dashboards. Builds on 11 ch.10.	13-grand-capstone-bookstore-platform/02-tenancy-and-crossplane-onboarding.md
Active-active multi-region	A topology where N regions all serve user traffic concurrently (vs active-passive failover); each region runs the full stack, data replicates between them, DNS routes by latency, and a region failure causes only a brief DNS shift.	13-grand-capstone-bookstore-platform/03-multi-region-active-active.md
CloudNativePG `ReplicaCluster`	The CloudNativePG CRD for cross-region streaming replication: one region runs the primary `Cluster`, others run `ReplicaCluster`s that follow it via Postgres streaming replication; promotion is a controlled spec edit.	13-grand-capstone-bookstore-platform/03-multi-region-active-active.md
Latency-based DNS / region affinity	DNS records (Route 53 latency policy / Cloud DNS load-balancing / Azure Traffic Manager performance routing) that resolve each client to the lowest-latency healthy region; the user-facing piece of active-active.	13-grand-capstone-bookstore-platform/03-multi-region-active-active.md
Keycloak	An open-source OIDC + SAML identity provider with realms, clients, users, groups, federation (LDAP / social), and an admin console; the v2 platform's IdP for human auth.	13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md
OIDC code+PKCE flow	The OAuth 2.0 / OIDC flow used by browser + mobile + SPA clients: client redirects to IdP with a PKCE code-verifier hash; IdP authenticates the user and returns an authorization code; client exchanges code + verifier for tokens — no client secret in the browser.	13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md
JWKS (JSON Web Key Set)	The IdP's published set of public signing keys at a well-known URL (`/.well-known/jwks.json`); verifiers (Istio, API gateways, services) fetch and cache JWKS to validate JWT signatures without sharing a secret with the IdP.	13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md
`RequestAuthentication` (Istio)	The Istio CRD that declares which JWT issuers + JWKS endpoints the mesh accepts on traffic to a workload; failed-validation requests get `401`; verified claims become request attributes for `AuthorizationPolicy`.	13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md
`AuthorizationPolicy` (Istio)	The Istio CRD that allows / denies / logs requests based on source identity (mTLS principal), JWT claims, method/path, and headers; the mesh's L7 authZ checked after `RequestAuthentication` validates the token.	13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md
Meilisearch	An open-source full-text search engine with typo tolerance, faceting, and synonyms; deployed on Kubernetes as the v2 product-discovery backend, indexed from Postgres via Debezium CDC.	13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md
Debezium	An open-source CDC platform that reads database transaction logs (Postgres logical replication / MySQL binlog / MongoDB oplog) and emits row-level change events to Kafka; the canonical Postgres → Kafka bridge for the outbox pattern.	13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md
CDC (Change Data Capture)	The pattern of streaming row-level inserts / updates / deletes out of a database as events, typically from its WAL / binlog — moves data without dual-writes or polling, and turns the DB into the source of truth for downstream consumers (search index, analytics, caches).	13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md
Strimzi	The Kubernetes operator for Apache Kafka: CRDs for `Kafka` (cluster), `KafkaTopic`, `KafkaUser`, `KafkaConnect`, `KafkaConnector`, `KafkaBridge`, `KafkaMirrorMaker2`; the canonical "Kafka on Kubernetes" install used in v2.	13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md
`KafkaConnect` (Strimzi CRD)	A Strimzi CRD that runs Kafka Connect workers in a cluster; the runtime that hosts source / sink connectors (e.g. Debezium Postgres source, Meilisearch sink).	13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md
`KafkaConnector` (Strimzi CRD)	The Strimzi CRD declaring one connector instance (class + config) inside a `KafkaConnect` cluster: e.g. `io.debezium.connector.postgresql.PostgresConnector` for CDC ingest, a Meilisearch sink for indexing.	13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md
Outbox pattern	A reliable cross-system event-emission pattern: the same DB transaction that mutates business state also inserts a row into an `outbox` table; a CDC process (Debezium) streams those rows to Kafka. Avoids the dual-write inconsistency between DB and message bus.	13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md
Saga compensation	The compensating-transaction half of the saga pattern: when a multi-step distributed workflow fails partway, each completed step has a published "undo" step (refund a payment, release stock, cancel a shipment) that is run in reverse order; replaces 2PC across services.	13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md
Stripe sandbox / webhook signature verification	Stripe's test-mode environment (`sk_test_…` keys, `tok_visa` etc.) and the obligatory verification step on inbound webhooks: every webhook is signed with the endpoint's secret and an HMAC over the body+timestamp; the v2 `payments-worker` rejects any webhook without a valid `Stripe-Signature` within a 5-minute window.	13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md
`HTTPRoute` (Gateway API)	The Gateway API CRD that attaches HTTP routing rules (path / header / method matches, weighted backends, filters, timeouts) to a `Gateway` listener; the L7 routing primitive at the v2 edge.	13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md
Coraza	An open-source Go re-implementation of the ModSecurity rules engine; runs inside Envoy / Caddy / nginx as a WAF that loads OWASP CRS rules and produces audit logs in the ModSec format. The v2 edge plugs Coraza into Istio as a Wasm filter.	13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md
WAF (Web Application Firewall)	An L7 inspection layer that blocks common attacks (SQLi, XSS, path traversal, scanner fingerprints) before requests reach the app; in v2 deployed as Coraza+OWASP CRS at the Istio edge with a per-rule anomaly-score threshold.	13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md
OWASP CRS (Core Rule Set)	The canonical open-source WAF rule set maintained by OWASP: signatures + anomaly-score rules for SQL injection, XSS, LFI/RFI, RCE, scanners, and protocol violations; consumed by ModSecurity and Coraza.	13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md
MLflow Model Registry	MLflow's versioned model store layered on top of run-tracking: each registered model has versions in stages (`None` / `Staging` / `Production` / `Archived`) and webhook transitions; KServe `InferenceService`s pin against the registry URI rather than a raw artifact path.	13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md
Alibi-Detect	An open-source drift / outlier / adversarial-example detection library (Seldon project); deployed in v2 as a sidecar that computes drift scores over a rolling production window and fires a Kafka event when a threshold is breached.	13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md
Drift detection (data / model / concept)	Continuous monitoring of model inputs and outputs for distribution shift: data drift = input features move; model drift = prediction-quality metrics degrade against a holdout; concept drift = the relationship between inputs and the true target changes — each is detected differently and each demands a retrain.	13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md
`InferenceService` canary (`canaryTrafficPercent`)	The KServe field on `InferenceService.spec.predictor` that routes a percentage of inference traffic to the new revision while the rest stays on the previous one; promoted by raising the percentage or rolled back by clearing the canary spec.	13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md
OpenTelemetry Collector	The vendor-neutral OTel pipeline daemon: receivers (OTLP / Prometheus / Jaeger / etc.), processors (batch / attributes / tail-sampling), and exporters (OTLP / Prometheus remote-write / Tempo / Loki) — deployed in v2 as a DaemonSet (agent) + a Deployment (gateway).	13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md
OTLP (OpenTelemetry Protocol)	The native wire protocol of OpenTelemetry (gRPC or HTTP/Protobuf, ports 4317 / 4318) that carries traces, metrics, and logs from SDKs and Collectors; the v2 standard for emit + collector-to-collector.	13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md
Grafana Tempo	An open-source, horizontally scalable, object-storage-backed trace store (CNCF / Grafana Labs) that ingests OTLP traces and serves them to Grafana via TraceQL; the v2 trace backend.	13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md
Grafana Loki	An open-source log aggregation system (Grafana Labs) that indexes only labels (not log content) and stores compressed chunks in object storage; the v2 log backend, queried via LogQL.	13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md
Grafana variable templating	Grafana's dashboard variables (`$tenant`, `$region`, `$service`, …) that turn one dashboard into N — a single panel definition rendered per tenant or per region by pivoting on the selected variable, so the v2 platform team builds N dashboards once, not N times.	13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md
Alertmanager inhibition	Prometheus Alertmanager's rule that one firing alert silences others (e.g. a `RegionDown` alert inhibits every per-service alert from that region) — prevents pager storms when one root cause fans out into many symptoms.	13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md
FinOps Foundation maturity (Inform / Optimize / Operate)	The three-phase maturity model from the FinOps Foundation: Inform = accurate cost allocation + visibility; Optimize = right-sizing, spot, savings plans, idle removal; Operate = governance, budgets, automation, accountability. The v2 cost chapter ladders through all three.	13-grand-capstone-bookstore-platform/10-cost-opencost-per-tenant-finops.md
Showback vs chargeback	Two cost-allocation postures: showback shows each team its bill but doesn't transfer money (information only); chargeback actually moves budget between teams. Most platforms start with showback to build trust, graduate to chargeback once allocation is provably right.	13-grand-capstone-bookstore-platform/10-cost-opencost-per-tenant-finops.md
Backstage Software Catalog	Backstage's central registry of components / APIs / resources / systems / domains: each entity is a YAML file in Git, ingested by a `LocationEntityProvider` or `ScaffolderEntityProvider`; the v2 catalog is seeded from Argo CD `Application`s + Crossplane claims.	13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md
Backstage Scaffolder	The Backstage subsystem that runs templated "create a new X" flows: a user fills a form (service name, owner, repo, tier) and the Scaffolder runs steps (fetch template, render, push to Git, register `Component`, kick CI) — the v2 golden path for new microservices.	13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md
Backstage TechDocs (MkDocs integration)	Backstage's docs-as-code subsystem: each `Component`'s repo carries a `mkdocs.yml` + `docs/`; the TechDocs builder renders MkDocs into static HTML and Backstage serves it inline against the catalog entity.	13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md
Backstage plugin model	Backstage as a Node.js app with a typed plugin API: frontend plugins are React extensions registered into the app's routing/sidebar; backend plugins are Express-style routers; built-in plugins integrate Argo CD, Kubernetes, GitHub, Prometheus, PagerDuty, costs. The v2 platform composes the IDP from official plugins.	13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md
Runbook	The fixed-shape on-call artifact for one alert: page → check (the four things) → diagnose (the symptom tree) → mitigate (the smallest action that restores serve) → postmortem (the follow-up); v2 ships a runbook per alert, all referenced from the alert annotation.	13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md
On-call rotation (primary / secondary)	The duty schedule: a primary engineer carries the pager for the week; a secondary backs them up (acks if primary doesn't in N minutes, joins long incidents); follow-the-sun rotations spread the load across timezones — v2 codifies the structure, not just the schedule.	13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md
DR drill / RTO / RPO	The rehearsed disaster-recovery exercise plus its two targets: RTO (recovery time objective) = how long until the service is back up; RPO (recovery point objective) = how much data loss is acceptable. v2 runs a monthly DR drill measuring both.	13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md
Chaos game-day	A scheduled, blast-radius-bounded chaos-engineering exercise: a hypothesis is declared (e.g. "killing region us-east-1 shifts traffic in <2 minutes"), Chaos Mesh experiments run, observability is watched, results feed back into the runbook. v2 schedules one quarterly.	13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md
Blameless postmortem	The post-incident document that explains what happened, in what timeline, with what impact, with what contributing factors, and with action items owned by a name and a date — without naming-and-blaming individuals; the only kind of postmortem an SRE org publishes. Also see 15 ch.10 for the production-grade 48h-draft / 5-day-publish discipline.	13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md

Part 14 — EKS in Production: A-Z terms¶

Term	Definition	Covered in
`use_lockfile = true`	Terraform 1.10+ native S3 backend state-locking flag: instead of the legacy DynamoDB lock table, the lock is a sibling `*.tflock` object next to the state file in S3 — one fewer cloud resource to provision, and no IAM glue between the state bucket and a lock table.	14-eks-in-production-a-to-z/01-terraform-state-in-production.md
Bucket bootstrap pattern	The chicken-and-egg fix for Terraform-managing-its-own-state: a tiny `bootstrap-state.sh` (AWS CLI, not Terraform) creates the S3 bucket + KMS key + versioning + lifecycle once; thereafter the bucket holds its own state and Terraform takes over.	14-eks-in-production-a-to-z/01-terraform-state-in-production.md
`terraform state` commands	The four state-surgery subcommands every operator needs at least once: `list` (enumerate), `show` (inspect), `rm` (drop without destroying the cloud resource), and `mv` / `replace-provider` (relocate or rebind addresses) — used when a refactor or provider migration outpaces a plain `apply`.	14-eks-in-production-a-to-z/01-terraform-state-in-production.md
Terraform workspaces (vs Terragrunt vs separate roots)	Three ways to slice one Terraform tree into many environments: built-in workspaces (one root, many state files, shared variables) vs Terragrunt (a wrapper that DRY-templates separate roots with locked module versions) vs separate root modules (one directory per environment, fully duplicated, fully isolated); the chapter picks separate roots for blast-radius isolation.	14-eks-in-production-a-to-z/01-terraform-state-in-production.md
EKS standard support window	The 14-month period during which AWS patches an EKS minor version at the published k8s SLA (security CVEs, kube-apiserver/kubelet bugfixes) at the base $0.10/cluster-hour control-plane price; ends with the version's standard-support EOL, after which extended-support pricing kicks in automatically.	14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md
EKS extended support window	The 12-month grace period AWS offers after standard-support EOL: the cluster keeps running and receiving security patches but billed at $0.50/cluster-hour (~$365/month surcharge per cluster), giving teams a hard deadline to upgrade before the version is forcibly retired.	14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md
Blue-green cluster pattern	The N+1-version skip pattern: instead of in-place upgrading a single cluster through every minor (1.27 → 1.28 → 1.29 → 1.30), stand up a new cluster on the target version, validate workloads on it, shift traffic via DNS / load balancer, then tear down the old; the only safe path when you've fallen many versions behind.	14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md
`kubent` (kube-no-trouble)	An open-source CLI that scans a live cluster (or a directory of manifests) for API resources whose `apiVersion` is deprecated or removed in upcoming Kubernetes minors; the pre-upgrade gate that catches "this still works on 1.28 but will break on 1.30" before the in-place bump.	14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md
EKS managed addon	AWS's managed lifecycle for a cluster-critical addon (`vpc-cni`, `kube-proxy`, `coredns`, `aws-ebs-csi-driver`, `aws-efs-csi-driver`, etc.): AWS picks compatible versions per EKS minor, applies updates on request, and gates conflicts via the `resolve_conflicts_on_create` / `_on_update` policy — declared in Terraform as `aws_eks_addon`, not as a Helm release.	14-eks-in-production-a-to-z/03-eks-addon-management.md
`before_compute = true`	The Terraform AWS EKS module flag that forces the VPC-CNI addon to install before the first managed node group's nodes register; without it, nodes can become Ready before the CNI is installed, Pods get assigned to nodes with no network, and the cluster spends an hour in CrashLoopBackOff.	14-eks-in-production-a-to-z/03-eks-addon-management.md
`resolve_conflicts_on_create / _on_update`	EKS addon flags that decide what happens when an addon's config differs from AWS's defaults: `OVERWRITE` (AWS wins, replaces local edits) vs `PRESERVE` (local edits win); the difference between "kubectl edit ConfigMap aws-node survives the addon update" and "kubectl edit was silently overwritten on next reconcile".	14-eks-in-production-a-to-z/03-eks-addon-management.md
gp3 default StorageClass	The production-grade EBS volume type that should replace EKS's gp2 default the day a cluster lands: 20% cheaper at the same baseline, IOPS and throughput tunable independently of volume size, no burst-credit footgun; the chapter ships the Terraform / kubectl patch that swaps the default annotation.	14-eks-in-production-a-to-z/04-storage-classes-and-ebs.md
`WaitForFirstConsumer`	The StorageClass `volumeBindingMode` that defers PV provisioning until a Pod referencing the PVC is scheduled to a node; on EKS this keeps the EBS volume in the same AZ as the Pod, preventing the cross-AZ-mount failure mode of `Immediate` binding.	14-eks-in-production-a-to-z/04-storage-classes-and-ebs.md
`VolumeSnapshotContent` (cluster-scoped) vs `VolumeSnapshot` (namespaced)	The Kubernetes snapshot API's two halves: `VolumeSnapshot` is the namespaced user-facing request ("snapshot my PVC"), and `VolumeSnapshotContent` is the cluster-scoped representation of the actual cloud snapshot — exactly the PV/PVC split, one level up.	14-eks-in-production-a-to-z/04-storage-classes-and-ebs.md
AWS Budgets actions	The AWS Budgets feature that goes beyond email alerts: a breached budget can fire an SNS topic, attach a deny-IAM policy, stop EC2 instances, or trigger a Step Function — the only way to make a budget a hard guardrail instead of an after-the-fact alarm.	14-eks-in-production-a-to-z/06-cost-guardrails.md
infracost	An open-source CLI that turns a `terraform plan` JSON into a per-resource USD cost estimate against current cloud pricing, and a GitHub Action that posts a diff comment on every Terraform PR — the CI-side guardrail that catches a 50-NAT-gateway accidental fan-out before it merges.	14-eks-in-production-a-to-z/06-cost-guardrails.md
OIDC trust (GitHub Actions for AWS)	The OIDC federation pattern that lets a GitHub Actions workflow assume an AWS IAM role with no long-lived access keys: GitHub mints a short-lived OIDC token, AWS STS exchanges it for temporary credentials, and a trust policy on the role pins the GitHub org / repo / branch / environment that may assume it.	14-eks-in-production-a-to-z/07-infrastructure-cicd-and-drift.md
driftctl	An open-source drift-detection CLI that compares Terraform state against the actual cloud inventory and flags unmanaged + drifted + missing resources; the cluster-wide drift answer when `terraform plan -detailed-exitcode` only sees the resources in this state file.	14-eks-in-production-a-to-z/07-infrastructure-cicd-and-drift.md
Atlantis	An open-source Terraform CI server that runs `terraform plan` on every PR, posts the plan as a PR comment, and gates `apply` behind a `/atlantis apply` PR comment + branch protection — the self-hosted alternative to Terraform Cloud / Spacelift for plan-on-PR + apply-on-merge.	14-eks-in-production-a-to-z/07-infrastructure-cicd-and-drift.md
VPC Gateway endpoint vs Interface endpoint	Two shapes of AWS PrivateLink: Gateway endpoints (S3, DynamoDB) are free, route through the VPC route table, and skip NAT entirely; Interface endpoints (ECR, STS, EC2, CloudWatch Logs, KMS, …) cost $0.01/hr per AZ + $0.01/GB processed and use private ENIs in your subnets.	14-eks-in-production-a-to-z/08-vpc-endpoints-and-egress.md
Graviton (AWS arm64)	AWS's ARM-based EC2 instance line (c7g / m7g / r7g, and successors); typically ~20% cheaper than the equivalent x86 instance at the same SLA, identical from a Kubernetes perspective — provided every container image runs an `arm64` manifest.	14-eks-in-production-a-to-z/09-arm-graviton-on-eks.md
Multi-arch container image (`docker buildx --platform`)	A single image tag whose OCI manifest is a list pointing at per-platform variants (`linux/amd64`, `linux/arm64/v8`); `docker buildx build --platform linux/amd64,linux/arm64` produces it. The disciplinary requirement for Graviton + x86 in the same cluster.	14-eks-in-production-a-to-z/09-arm-graviton-on-eks.md
App-of-Apps (Argo CD)	A bootstrap pattern where a single root Argo CD `Application` reconciles a directory of other `Application` manifests, each of which reconciles a real workload; the GitOps version of "an array of arrays" that lets one Application govern dozens of children. Also see 14 ch.10 for the EKS bootstrap path.	07-delivery/04-gitops-argocd.md
Argo CD self-management	The GitOps loop where Argo CD's own manifests (Helm release values, projects, RBAC) live in Git and are reconciled by Argo CD itself; the second-stage payoff of the Terraform bootstrap — after `terraform apply` installs Argo CD once, the same Argo CD adopts itself and Terraform never touches it again.	14-eks-in-production-a-to-z/10-gitops-bootstrap-fresh-cluster.md
Route 53 latency-based routing	Route 53's record type that returns the IP of the lowest-latency healthy region for each resolver (measured continuously by AWS); the DNS-side primitive of cloud active-active, with TTL choosing how quickly clients shift on a regional failure.	14-eks-in-production-a-to-z/11-multi-region-active-active-cloud.md
AWS Global Accelerator	An AWS edge product that anycast-publishes two static IPs in the AWS global network and steers clients to the nearest healthy region via the AWS backbone; reduces failover time from DNS-TTL (60s) to ~30s and improves jitter, at $0.025/hr + transfer per accelerator.	14-eks-in-production-a-to-z/11-multi-region-active-active-cloud.md
CNPG `ReplicaCluster` (cloud reality)	The cloud-deployed shape of Part 13 ch.03's pattern: CloudNativePG `Cluster` in region A acts as primary, `ReplicaCluster`s in regions B + C stream from it over Transit Gateway peering, promotion is a controlled `spec.replica.enabled: false` toggle — measured here against real RTO / RPO numbers. Also see 13 ch.03 for the kind-local shape.	14-eks-in-production-a-to-z/11-multi-region-active-active-cloud.md
Cosign keyless signing	The Sigstore signing path that uses OIDC short-lived certs (via Fulcio) and a transparency log (Rekor) instead of a long-lived signing keypair: the workflow's OIDC identity is the cert subject, the cert is good for ~10 minutes, the Rekor entry is permanent; no key to rotate, no key to lose. Also see 15 ch.03.	14-eks-in-production-a-to-z/12-supply-chain-security.md
syft (SBOM generation)	An open-source CLI (Anchore) that scans a container image / filesystem / source tree and emits an SBOM in SPDX-JSON or CycloneDX-JSON; the SBOM is then bound to the image digest with `cosign attest` so admission can verify provenance + content together.	14-eks-in-production-a-to-z/12-supply-chain-security.md
grype (CVE scanner)	The companion to syft (Anchore): takes an SBOM or an image and emits a CVE report scoped to actually-installed package versions; used in CI to fail builds when a Critical/High CVE crosses a policy threshold.	14-eks-in-production-a-to-z/12-supply-chain-security.md
SLSA framework	The "Supply chain Levels for Software Artifacts" framework from Google / OpenSSF: four levels (L1–L4) of increasing guarantee that a build's source, build process, and provenance are unfalsifiable; L3 is the practical target for production CI/CD (signed provenance + hermetic build + isolated runner). Also see 15 ch.03.	14-eks-in-production-a-to-z/12-supply-chain-security.md
ECR enhanced scanning	AWS ECR's premium scanning tier (Inspector-powered): continuous CVE scans of pushed images, OS + language-package coverage, results published to Inspector findings; $0.09/image/month, vs the free "basic" tier that scans once on push only.	14-eks-in-production-a-to-z/12-supply-chain-security.md
Kyverno `verifyImages`	The Kyverno `ClusterPolicy` rule type that gates admission on cosign signatures: matches an image (registry / repo / tag glob) against an expected OIDC `issuer` + `subject` regex, rejects on mismatch; the production gate that turns "we sign images" into "unsigned images cannot run".	14-eks-in-production-a-to-z/12-supply-chain-security.md
Falco (eBPF driver)	An open-source CNCF runtime-security tool: kernel-level system-call observer (modern eBPF driver replaces the old kernel module) + a YAML rules language (`falco_rules.yaml`) that fires on policy violations ("a shell spawned in a container", "writes to `/etc/`") with severity + tags.	14-eks-in-production-a-to-z/13-runtime-defense-and-container-security.md
Tetragon	Cilium-project runtime-security tool: pure-eBPF, kernel-attached `TracingPolicy` CRD that filters and (optionally) enforces on syscall events; lower overhead than Falco for high-volume rule sets, and can block in-kernel rather than only log.	14-eks-in-production-a-to-z/13-runtime-defense-and-container-security.md
GuardDuty for EKS (Audit + Runtime)	AWS's managed threat-detection for EKS clusters: Audit Log Monitoring ingests EKS control-plane audit logs and flags suspicious API patterns; Runtime Monitoring runs an in-cluster agent that observes process / file / network behaviour; both produce GuardDuty findings priced per-finding.	14-eks-in-production-a-to-z/13-runtime-defense-and-container-security.md
Velero BSL / VSL / Schedule / Kopia	The four moving pieces of a Velero install: BSL (`BackupStorageLocation`, the object-store bucket for API-object dumps), VSL (`VolumeSnapshotLocation`, the cloud-snapshot configuration for PVs), Schedule (a CronJob-shaped `Schedule` CRD), and Kopia (the default content-addressable de-duplicating uploader). Also see 08 ch.02 for the foundational Velero concepts.	14-eks-in-production-a-to-z/14-backup-and-restore-velero.md
Cilium native routing	Cilium's no-overlay mode where Pod traffic is routed directly through the VPC routing table (one IP per Pod, real VPC reachability) rather than via VXLAN encapsulation; the EKS-flavoured Cilium install when the goal is wire-speed and VPC-Flow-Logs visibility.	14-eks-in-production-a-to-z/15-cilium-ebpf-on-eks.md
Hubble (Cilium observability)	The Cilium project's flow-level observability layer: every L3/L4 + L7 flow Cilium handles is exposed as a structured event (identity-aware, not just IP-aware) consumable by `hubble observe`, the Hubble UI, or Hubble's metric exporter; the visibility VPC Flow Logs cannot give you.	14-eks-in-production-a-to-z/15-cilium-ebpf-on-eks.md
Telepresence (personal-intercept)	An open-source dev-loop tool: redirects a single deployment's traffic from a real cluster Pod to a process running on the developer's laptop ("personal intercept"); the developer debugs locally while the request still hits real cluster dependencies.	14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md
Mirrord (mirror / steal modes)	A dev-loop tool from MetalBear: mirror mode copies traffic to a laptop process for read-only debugging (production-safe); steal mode redirects traffic for full request-response handling (interactive debugging); the more lightweight cousin of Telepresence.	14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md
Skaffold (sync mode)	A Google dev-loop CLI: watches source, rebuilds + redeploys on change; sync mode copies edited files directly into a running container (skipping the Docker build entirely) for static-asset / interpreted-language inner loops.	14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md
Tilt	A Tiltfile-driven dev-loop orchestrator (Starlark config); watches files, rebuilds + redeploys, and exposes a single dashboard with live logs, build status, and pod health across the whole micro-architecture; the multi-service complement to Skaffold's single-service focus.	14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md
Devcontainer	The open Devcontainer Specification (Microsoft / VS Code): a `devcontainer.json` declaring the IDE-in-a-container — base image, features, post-create commands, port forwards — so every developer gets the same toolchain regardless of laptop OS; also runs in GitHub Codespaces.	14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md
AWS Config conformance pack	An AWS Config feature that bundles a curated set of Config rules + remediation actions into a deployable YAML; the chapter ships an EKS-shaped pack covering encryption-at-rest, public-access prevention, IRSA-only workloads, and tag governance — the "audit baseline" you apply once per account.	14-eks-in-production-a-to-z/17-cross-region-dr-account-baseline-90-day-runbook.md
IAM Access Analyzer	An AWS-managed service that continuously analyses IAM policies + resource policies for over-privileged access and external trust-policy exposure; the production-account guardrail that flags "this S3 bucket is public" or "this role can be assumed by an unknown account" before a breach.	14-eks-in-production-a-to-z/17-cross-region-dr-account-baseline-90-day-runbook.md
90-day production-readiness runbook	The Part 14 capstone artefact: a structured 13-week onboarding plan for a team taking over an EKS production platform, with weekly checkpoints (state hygiene → cluster lifecycle → addons → storage → cost → CI/CD → networking → Graviton → GitOps → multi-region → supply chain → runtime defense → backup → eBPF → DX → capstone deliverable), each tied to a concrete artefact in `examples/bookstore-platform/`.	14-eks-in-production-a-to-z/17-cross-region-dr-account-baseline-90-day-runbook.md

Part 15 — Day-to-Day Production Operations terms¶

Term	Definition	Covered in
PR-to-production lifecycle	The mental model Part 15 is built around: a change moves through eight stages (commit → PR → CI → review/merge → GitOps repo update → Argo CD reconcile → progressive rollout → SLO gate → done) and the load-bearing rule that every production change is a Git commit — nothing else.	15-day-to-day-production-ops/01-pr-to-production-lifecycle.md
GitHub Actions OIDC + `role-to-assume`	The application-side use of OIDC trust: a GitHub Actions job sets `permissions: id-token: write`, calls `aws-actions/configure-aws-credentials` with `role-to-assume: <ARN>`, and gets short-lived STS credentials — no static AWS keys in repository secrets. Also see 14 ch.07.	15-day-to-day-production-ops/02-application-cicd-pipelines.md
Branch-protection rules	GitHub repo settings that gate merges into a protected branch: require N approving reviews, require status checks, require linear history, dismiss stale reviews on push, require signed commits, disallow force-push; the merge-gate side of the CI discipline.	15-day-to-day-production-ops/02-application-cicd-pipelines.md
Required status checks	The branch-protection subset that names the CI workflow jobs (`lint`, `test`, `scan`, `build`, `sign`) that MUST pass before a PR can merge; the Git-side enforcement of "tests are not optional", paired with `required = true` on each check.	15-day-to-day-production-ops/02-application-cicd-pipelines.md
Cosign keyless via OIDC token	The CI-side application of keyless signing: a GitHub Actions job calls `cosign sign --yes <IMAGE>`, the workflow's OIDC token authenticates to Fulcio, the resulting cert lists the workflow's repo + branch + workflow-file path as subject; an admission policy can then accept only signatures from a specific workflow. Also see 14 ch.12.	15-day-to-day-production-ops/03-image-signing-and-provenance.md
SLSA provenance attestation	A signed JSON document conforming to the SLSA provenance schema (subject digest + builder identity + build invocation + materials), produced by `docker buildx --provenance=true` or by `cosign attest --predicate provenance.json`; consumed at admission to verify "this image was built by this workflow on this commit". Also see 14 ch.12.	15-day-to-day-production-ops/03-image-signing-and-provenance.md
Multi-environment promotion (dev / staging / prod gates)	The dev → staging → prod pipeline as Git mechanics: three Kustomize overlays differ only in image-tag + secret-source + scale, promotion is a Git PR that bumps `images[].newTag` in the next environment's overlay, each environment gates on its own Argo CD sync + analysis run.	15-day-to-day-production-ops/04-multi-environment-promotion.md
Argo CD `ApplicationSet` (Cluster generator)	The ApplicationSet generator that fans one template into N `Application`s — one per registered Argo CD cluster — with cluster labels driving overlay path / branch / target namespace; the production primitive for "same app, three environments" or "same app, N regions". Also see 11 ch.06.	15-day-to-day-production-ops/04-multi-environment-promotion.md
Vault Kubernetes auth method	Vault's auth backend that trusts the cluster's ServiceAccount projected tokens: a Pod presents its projected token, Vault verifies it against the cluster's JWKS, looks up the bound role, and issues a Vault token with attached policies — the production-grade alternative to AppRole or static tokens. Also see 11 ch.05.	15-day-to-day-production-ops/05-production-secrets-vault-eso.md
Vault dynamic database secrets	A Vault secrets engine (`database/`) that mints a short-lived Postgres username + password on each request, with TTL + max-TTL; the app gets a fresh credential per lease, Vault revokes on TTL expiry, and a leaked credential becomes worthless within minutes.	15-day-to-day-production-ops/05-production-secrets-vault-eso.md
External Secrets Operator (ESO) — production	The production deepening of ESO: a real HA Vault `ClusterSecretStore`, per-tenant `ExternalSecret` resources, `refreshInterval` tuned against Vault lease TTL, conflict policy + Helm-templated `target.template` so the resulting Kubernetes `Secret` carries app-shaped keys rather than raw Vault paths. Also see 11 ch.05.	15-day-to-day-production-ops/05-production-secrets-vault-eso.md
Secret rotation (lease TTL / refresh interval)	The two-clock rotation discipline: Vault `lease_ttl` decides how often the source credential changes; ESO `refreshInterval` decides how often the Kubernetes Secret re-fetches; the two clocks must satisfy `refreshInterval << lease_ttl` so Pods always see a valid credential.	15-day-to-day-production-ops/05-production-secrets-vault-eso.md
Argo Rollouts `AnalysisTemplate` (production SLO gate)	The CRD that defines reusable metric queries (Prometheus / Datadog / NewRelic / WebExpression) with success / failure thresholds; referenced by `Rollout.spec.strategy.canary.analysis` to gate promotion on real SLO metrics (success-rate, p99 latency, saturation) rather than wall-clock pauses. Also see 07 ch.05.	15-day-to-day-production-ops/06-progressive-delivery-in-production.md
Argo Rollouts canary vs blue-green (production)	The two production rollout strategies with different traffic semantics: canary shifts a percentage at a time (works for idempotent stateless services); blue-green flips 100% on success (the only safe choice for stateful workloads where in-flight transactions can't be split across versions).	15-day-to-day-production-ops/06-progressive-delivery-in-production.md
Argo Rollouts auto-rollback	The default `Rollout` behaviour when an `AnalysisRun` fails: the controller automatically aborts the rollout, shifts 100% of traffic back to the stable ReplicaSet, scales the new ReplicaSet to 0, and emits a `RolloutAborted` event — the production safety net that makes "deploy to prod" survivable.	15-day-to-day-production-ops/06-progressive-delivery-in-production.md
Rollback layer matrix (code / data / config)	The decision matrix every production platform needs: code rollback (Argo CD revision pin, Argo Rollouts abort, Helm rollback) when the new binary is bad; data rollback (Postgres PITR, S3 versioning, Velero restore) when data was corrupted; config rollback (`git revert` an Argo CD Application's targetRevision) when a manifest change caused the symptom — picking the wrong layer makes the outage worse.	15-day-to-day-production-ops/07-rollback-playbook.md
Postgres point-in-time recovery (PITR)	The Postgres recovery shape backed by continuous WAL archiving: every committed transaction's WAL segment ships to object storage, and a restore can roll forward to any `recovery_target_time`; CloudNativePG implements this as `Cluster.spec.backup` + `Cluster.spec.recovery.recoveryTarget.targetTime`.	15-day-to-day-production-ops/07-rollback-playbook.md
S3 versioning rollback	The simplest data-layer rollback: an S3 bucket with versioning enabled retains every object version (and delete-marker); restoring is `aws s3api copy-object` of the previous version on top of the current key, or a bulk replay via S3 Inventory + Batch Operations.	15-day-to-day-production-ops/07-rollback-playbook.md
Forward-compatible schema (rollback prerequisite)	The disciplinary requirement for safe code rollback: every database schema change ships in two phases — additive change first (new column / new table / nullable defaults) deployed and stabilized, then the code that uses it; rolling back the code leaves the schema valid because the old code never read the new column.	15-day-to-day-production-ops/07-rollback-playbook.md
Feature flag (vs config)	A run-time boolean / variant decision served by a flag service, distinguishing two cohorts in one binary — vs an environment config value, which is applied at deploy time and applies to all traffic; the discipline that decouples deploy (binary lands) from release (feature becomes visible).	15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md
OpenFeature	A CNCF-incubating vendor-neutral feature-flag SDK + spec: app code calls a single `client.GetBooleanValue("flag", default, evalCtx)` API; the actual provider (Flagsmith / LaunchDarkly / Unleash / GoFeatureFlag / in-memory) is configured by a Provider implementation injected at startup.	15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md
Flagsmith / LaunchDarkly / Unleash (feature-flag providers)	Three production options behind OpenFeature: Flagsmith (open-source, self-hosted, the chapter default), LaunchDarkly (managed SaaS; lower ops cost, data-residency surcharge), Unleash (open-source self-hosted alternative with a stronger client-side SDK story); pick on hosting + data-residency + budget.	15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md
Dark launch (deploy ≠ release)	The production shape where new code ships to production behind a flag with default-off; the binary is live but the feature is invisible, log-only, or restricted to internal traffic; the flag flip is the release event.	15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md
Kill-switch flag	A boolean flag whose only purpose is to instantly disable a code path in production without a deploy; rolling the kill switch from `true` → `false` flips behaviour within the flag service's TTL — typically seconds — vs a deploy + rollout that takes minutes.	15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md
Hotfix workflow + breakglass	The emergency lane when normal CI/CD is too slow for a P0: branch-protection bypass via repo admin, CI fast-path that keeps `scan` but skips slow integration suites, a breakglass IAM role with full admin + 1-hour TTL + every action audited to CloudTrail, and the post-incident cleanup that rotates credentials + drift-checks Terraform.	15-day-to-day-production-ops/09-hotfix-workflow-and-breakglass.md
Breakglass IAM role (time-limited admin)	An IAM role with full admin permissions but a 1-hour STS session TTL + an explicit assume-role trust policy requiring MFA + a CloudTrail alarm on every assume-event; assumed only during P0s, and the post-incident step rotates the role's keys + reviews every action it took.	15-day-to-day-production-ops/09-hotfix-workflow-and-breakglass.md
Audit-log immutability (CloudTrail Stop/Delete denied)	An IAM service-control policy (SCP at the org level, or an explicit deny in the production-account boundary) that forbids `CloudTrail:StopLogging`, `:DeleteTrail`, and `S3:DeleteObject` against the audit-log bucket — even for the breakglass admin role; an attacker with admin still cannot erase the trace.	15-day-to-day-production-ops/09-hotfix-workflow-and-breakglass.md
Incident severity matrix (P0 / P1 / P2 / P3)	The triage scale every production team needs: P0 = user-visible outage / data loss (page everyone, war room); P1 = degraded experience for many users; P2 = single-service or low-impact; P3 = noise / known-issue / scheduled-fix; each tier maps to a different page policy, escalation cadence, and postmortem requirement.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Mean time to acknowledge (MTTA)	The on-call metric measuring the time between a page firing and a human responding (typically ack-via-PagerDuty); the leading indicator of "are alerts paging the right person?" — target is single-digit minutes for P0, low double-digits for P1.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Mean time to mitigate (MTTM)	The on-call metric measuring time between page-ack and the customer-visible symptom being mitigated (not necessarily root-caused); the practical SLA for "how long was the user impacted" — target depends on tier but is typically 15 min P0, 1 hour P1.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Mean time to detect (MTTD)	The on-call metric measuring time between the real start of an incident (in the logs / metrics) and the page firing; the leading indicator of "are we monitoring the right thing?" — low MTTD requires user-facing SLO alerts, not just CPU / disk thresholds.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
5 Whys analysis	The root-cause technique borrowed from Toyota: ask "why did that happen?" five times in sequence, each answer becoming the next question; the structured way to push past the proximate cause (the pod OOMed) to a contributing cause (the request payload grew 10x because a feature flag flipped) to a systemic cause (no canary on flag flips).	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Postmortem deadline: 48h draft + 5-day publish	The deadline discipline that prevents postmortem rot: a draft (timeline + impact + action items) within 48h of the incident, a published + reviewed postmortem within 5 working days; missing either deadline triggers an escalation to the engineering manager rather than a slip.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
On-call handoff	The weekly handover ritual between primary on-call shifts: outgoing engineer walks incoming through (1) live incidents, (2) the page-volume last week, (3) the open-action-items dashboard, (4) any runbook gaps discovered; turns on-call from "luck of the draw" into a continuous improvement loop.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Alertmanager inhibition rules (production)	Prometheus Alertmanager `inhibit_rules` that suppress noisy child alerts when a parent root-cause alert is firing (e.g. `RegionDown` inhibits every per-service alert in that region); the production-grade defence against pager storms during cluster-wide outages. Also see 13 ch.09.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Runbook URL annotation (alert hygiene)	The mandatory `annotations.runbook_url` on every PrometheusRule alert: clicking it from PagerDuty / Slack opens the alert's specific runbook (symptoms → diagnosis → mitigation → owners). The hygiene rule: an alert without a runbook URL fails CI.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
War room (synchronous video bridge for P0)	The synchronous response artefact for P0 incidents: a always-open Zoom / Meet / Teams bridge in the on-call channel, the IC + SME + comms-owner join immediately, status is broadcast in chat to keep async stakeholders informed; the structure that turns three siloed Slack threads into one converging response.	15-day-to-day-production-ops/10-incident-response-and-on-call.md
Cost review cadence (weekly Monday)	The fixed weekly slot when the platform team reviews last week's cost (OpenCost per-tenant + AWS Cost Explorer + savings-plan utilisation) and decides on follow-up actions; without a fixed cadence cost reviews drift into "we'll look next quarter" and the bill grows.	15-day-to-day-production-ops/11-day-to-day-production-ops.md
Capacity review cadence (bi-weekly Friday)	The fixed bi-weekly slot for capacity decisions: nodepool sizes, Karpenter `disruption.budgets`, HPA min/max, PDBs, request-vs-limit drift; the forum where "we should bump the system NodePool" stops being a Slack message and becomes a tracked change.	15-day-to-day-production-ops/11-day-to-day-production-ops.md
90-day production-ownership runbook	The Part 15 capstone artefact: a structured 90-day plan for a team taking over a production Bookstore Platform v2 — weeks 1–4 (orient + on-call shadowing + the lifecycle), weeks 5–8 (own the change discipline: CI/CD + signing + rollback + flags + hotfix), weeks 9–12 (own production: incidents + cadence + the 90-day check-in), with explicit checkpoints + deliverables + readiness scorecard.	15-day-to-day-production-ops/12-capstone-first-90-days.md

See also: Appendix A — kubectl cheatsheet for the commands behind these terms, Appendix C — YAML & API conventions for the API/SSA/deprecation mechanics, and the official Kubernetes glossary: https://kubernetes.io/docs/reference/glossary/.