Appendix B — Glossary¶
Seeded with the original guide; expanded as new Parts shipped. Every significant term and acronym introduced across the guide — currently Parts 00–15 (115 chapters) — appears here, with a precise 1–3 sentence definition and a link to the chapter that defines or most-covers it. Definitions are deliberately brief; the linked chapter has the full treatment and the "why".
Sectioned by domain (and roughly the order the guide introduces them). Use your browser's find for a specific term; bold is the headword.
Cluster architecture & the API model¶
| Term | Definition | Covered in |
|---|---|---|
| Kubernetes (k8s) | An open-source platform that runs containerized workloads across a cluster of machines, continuously reconciling declared desired state. | 00-foundations/01-why-kubernetes.md |
| Cluster | A set of worker nodes plus a control plane that together run and manage containerized workloads as one logical system. | 00-foundations/03-architecture-overview.md |
| Control plane | The components that make global decisions and detect/respond to events: kube-apiserver, etcd, scheduler, controller-manager (and cloud-controller-manager). | 00-foundations/04-control-plane-deep-dive.md |
| Node | A worker machine (VM or physical) that runs Pods; each runs a kubelet, a container runtime, and kube-proxy, and reports capacity/health to the control plane. | 00-foundations/05-node-components.md |
| kube-apiserver | The front door of the cluster: a REST API server that authenticates, authorizes, admits, and validates objects and is the only component that talks to etcd. | 00-foundations/04-control-plane-deep-dive.md |
| etcd | The consistent, distributed key-value store (Raft consensus) that holds the entire cluster state; the single source of truth, written only by the apiserver. | 00-foundations/04-control-plane-deep-dive.md |
| kube-scheduler | The control-plane component that watches for unscheduled Pods and binds each to a feasible node by filtering then scoring nodes. | 04-scheduling/01-scheduler-and-nodes.md |
| kube-controller-manager | A single binary running the built-in controllers (Deployment, ReplicaSet, Node, Job, EndpointSlice, …), each a reconciliation loop. | 00-foundations/04-control-plane-deep-dive.md |
| cloud-controller-manager | The control-plane component that integrates a cloud provider (LoadBalancer services, node lifecycle, routes); absent on bare local clusters. | 00-foundations/04-control-plane-deep-dive.md |
| kubelet | The node agent that watches the apiserver for Pods assigned to its node, drives the container runtime via CRI to match the PodSpec, and runs probes. | 00-foundations/05-node-components.md |
| kube-proxy | The node component that programs iptables or IPVS rules so Service virtual IPs load-balance to backing Pods. On clusters using a full-eBPF CNI (e.g. Cilium kube-proxy replacement), kube-proxy is replaced by the CNI and does not run. | 02-networking/02-services.md |
| Container runtime | The software that actually runs containers on a node (e.g. containerd), invoked by the kubelet through the CRI. | 00-foundations/05-node-components.md |
| CRI (Container Runtime Interface) | The gRPC API between the kubelet and the container runtime, decoupling Kubernetes from any specific runtime. | 00-foundations/05-node-components.md |
| containerd | A widely-used CRI-compatible container runtime; the runtime inside kind/k3d nodes in this guide. | 00-foundations/05-node-components.md |
| pause container (sandbox) | The tiny per-Pod container that holds the Pod's shared network/IPC namespaces so app containers can join them. | 00-foundations/05-node-components.md |
| Object | A persisted entity in the API with apiVersion, kind, metadata, spec (desired), and usually status (observed). |
00-foundations/06-declarative-api-model.md |
| GVK (Group/Version/Kind) | apiVersion + kind; names the type and routes a request to the controller and storage that own it. |
00-foundations/06-declarative-api-model.md |
| GVR (Group/Version/Resource) | The lowercase plural REST form of a GVK (e.g. apps/v1/deployments) used in API paths and RBAC rules. |
appendix/C-yaml-and-api-conventions.md |
spec vs status |
The universal divide: humans/controllers write spec (desired); the owning controller/kubelet writes status (observed). |
00-foundations/06-declarative-api-model.md |
| Declarative model | You describe desired state in objects and controllers continuously make reality match — you assert the destination, not the steps. | 00-foundations/06-declarative-api-model.md |
| controller | A control loop that watches a resource's desired state and continuously acts to drive actual state toward it. | 00-foundations/06-declarative-api-model.md |
| reconciliation | The core operating principle: observe actual, diff against desired, act to close the gap — repeated forever. | 00-foundations/06-declarative-api-model.md |
| level-triggered | Reacting to the current state (re-derived every loop) rather than to individual events; why Kubernetes self-heals after missed events. | 00-foundations/06-declarative-api-model.md |
| resourceVersion | An opaque per-object value reflecting etcd's revision at last write; the basis of optimistic concurrency and watch resumption. | 00-foundations/06-declarative-api-model.md |
| Optimistic concurrency | Updates carry the resourceVersion they read; the apiserver commits only if it is still current, else returns Conflict (409). |
00-foundations/06-declarative-api-model.md |
| Watch | A streaming API that delivers object change events from a resourceVersion; how controllers and informers stay current. |
00-foundations/04-control-plane-deep-dive.md |
kubectl apply (3-way merge) |
Declarative update that merges your manifest, the live object, and the last-applied config so it changes only fields you manage. | 00-foundations/06-declarative-api-model.md |
| Server-Side Apply (SSA) | Apply where the apiserver tracks per-field ownership in managedFields and reports conflicts between managers. |
00-foundations/06-declarative-api-model.md |
| managedFields / field manager | The metadata.managedFields record of which manager owns each field; the mechanism that makes co-ownership (e.g. Git + HPA) safe. |
appendix/C-yaml-and-api-conventions.md |
| kubectl | The official CLI that talks to kube-apiserver to create, inspect, update, and delete resources. | 00-foundations/07-local-cluster-setup.md |
| kubeconfig | The file (~/.kube/config or $KUBECONFIG) listing clusters, users, and contexts, with a current-context pointer. |
00-foundations/07-local-cluster-setup.md |
| Context | A named (cluster + user + default namespace) tuple in kubeconfig; switching context switches what kubectl targets. |
00-foundations/07-local-cluster-setup.md |
| kind / k3d | Tools that run a full Kubernetes cluster locally inside Docker containers (kind = "Kubernetes IN Docker"; k3d = k3s in Docker) — used for every hands-on. | 00-foundations/07-local-cluster-setup.md |
| OCI image | A standardized container image format (layers + config); what a registry stores and the kubelet pulls. | 00-foundations/02-containers-and-images.md |
| Image digest | The content-addressed sha256: identifier of an image; pinning by digest (not a mutable tag) makes deploys reproducible. |
00-foundations/02-containers-and-images.md |
| distroless image | A minimal image with no shell/package manager (e.g. gcr.io/distroless/static:nonroot); small attack surface — debug with kubectl debug, not exec sh. |
00-foundations/02-containers-and-images.md |
| Namespace | A virtual cluster scope for grouping/isolating resources and the unit for quotas, RBAC, and (with policy) network isolation. Introduced where the bookstore namespace is created (01-core-workloads/03); multi-tenant depth in 08-day-2-operations/04. |
01-core-workloads/03-resources-and-qos.md |
Workloads¶
| Term | Definition | Covered in |
|---|---|---|
| Pod | The smallest deployable unit: one or more containers sharing a network namespace (one IP), IPC, and volumes, always co-scheduled on one node. | 01-core-workloads/01-pods.md |
| initContainer | A container that runs to completion before app containers start; used for setup/wait-for-dependency steps. | 01-core-workloads/01-pods.md |
| sidecar | A helper container co-located in a Pod (logging/proxy/sync). As of v1.29+ a native sidecar is an initContainers[] entry with restartPolicy: Always — it starts before and runs alongside the app containers without blocking Pod startup/Job completion. |
01-core-workloads/01-pods.md |
| Adapter / Ambassador | Structural multi-container patterns: an adapter normalizes a container's output; an ambassador proxies its outbound connections. | 01-core-workloads/01-pods.md |
| liveness probe | A periodic check; if it fails, the kubelet restarts the container (recovers a hung process). | 01-core-workloads/02-health-and-lifecycle.md |
| readiness probe | A check that gates Service traffic; a not-Ready Pod is removed from its Service's EndpointSlice (no restart). | 01-core-workloads/02-health-and-lifecycle.md |
| startup probe | A probe that disables liveness/readiness until the app has started, for slow-starting containers. | 01-core-workloads/02-health-and-lifecycle.md |
| Pod lifecycle / phase | Pending → Running → Succeeded/Failed; plus container states and the Ready condition the kubelet writes. |
01-core-workloads/02-health-and-lifecycle.md |
preStop hook |
A container lifecycle hook run before SIGTERM on termination; commonly a short sleep to drain in-flight connections. | 01-core-workloads/02-health-and-lifecycle.md |
| Graceful termination | The shutdown sequence: removed from Endpoints, preStop, SIGTERM, then SIGKILL after terminationGracePeriodSeconds. |
01-core-workloads/02-health-and-lifecycle.md |
| Resource request | The amount of CPU/memory a container is guaranteed and scheduled against (reserved on a node). | 01-core-workloads/03-resources-and-qos.md |
| Resource limit | The maximum CPU/memory a container may use; exceeding memory → OOMKilled, exceeding CPU → throttled. | 01-core-workloads/03-resources-and-qos.md |
| QoS class | Guaranteed / Burstable / BestEffort, derived from requests vs limits; drives eviction order under node pressure. |
01-core-workloads/03-resources-and-qos.md |
| OOMKilled | A container terminated (exit 137) by the kernel OOM killer for exceeding its memory limit / node memory pressure. | 01-core-workloads/03-resources-and-qos.md |
| LimitRange | A namespace policy setting default/min/max requests and limits for containers that omit them. | 01-core-workloads/03-resources-and-qos.md |
| ResourceQuota | A namespace cap on aggregate resource consumption and object counts (CPU/memory, pods, PVCs, …). | 08-day-2-operations/04-multi-tenancy-and-namespaces.md |
| ReplicaSet | A controller ensuring a specified number of identical Pod replicas; normally owned by a Deployment. | 01-core-workloads/04-replicasets-and-deployments.md |
| Deployment | A workload resource managing stateless replicas via ReplicaSets, providing rolling updates, rollback, and revision history. | 01-core-workloads/04-replicasets-and-deployments.md |
| Rolling update | The default Deployment strategy: incrementally replace old Pods with new ones bounded by maxSurge/maxUnavailable. |
01-core-workloads/04-replicasets-and-deployments.md |
| Revision / rollback | A Deployment's recorded ReplicaSet history; kubectl rollout undo re-applies an earlier revision. |
01-core-workloads/04-replicasets-and-deployments.md |
| StatefulSet | A workload for stateful apps: stable per-Pod network identity and ordinal, ordered rollout, and a per-Pod PVC via volumeClaimTemplates. |
01-core-workloads/05-statefulsets.md |
| Headless Service | A Service with clusterIP: None that returns Pod IPs (and per-Pod DNS) directly; required by StatefulSets. |
01-core-workloads/05-statefulsets.md |
volumeClaimTemplates |
A StatefulSet field that provisions a dedicated, stable PVC per Pod ordinal. | 01-core-workloads/05-statefulsets.md |
| DaemonSet | Ensures a copy of a Pod runs on every (or a selected subset of) node — for node-level agents (log shippers, exporters, CNI). | 01-core-workloads/06-daemonsets.md |
| Job | A workload that runs Pods to successful completion (with completions/parallelism/backoff), then stops. | 01-core-workloads/07-jobs-and-cronjobs.md |
| CronJob | A controller that creates Jobs on a repeating cron schedule, with concurrency and history policies. | 01-core-workloads/07-jobs-and-cronjobs.md |
ttlSecondsAfterFinished |
A Job field that auto-deletes the Job (and its Pods) a set time after it finishes. | 01-core-workloads/07-jobs-and-cronjobs.md |
| Recreate strategy | A Deployment strategy that terminates all old Pods before creating new ones (brief downtime; needed when versions can't coexist). | 01-core-workloads/08-deployment-strategies.md |
| Blue-green deployment | Run two full environments and switch traffic atomically from old (blue) to new (green). | 01-core-workloads/08-deployment-strategies.md |
| Canary deployment | Shift a small fraction of traffic to a new version, observe, then promote or roll back. | 01-core-workloads/08-deployment-strategies.md |
| Singleton service / leader election | Ensuring exactly one active instance (e.g. a controller) via a Lease-based leader election. | 08-day-2-operations/05-operators-and-crds.md |
Networking¶
| Term | Definition | Covered in |
|---|---|---|
| Kubernetes networking model | Every Pod gets its own IP and all Pods can reach each other without NAT; the contract a CNI plugin implements. | 02-networking/01-networking-model.md |
| CNI (Container Network Interface) | The plugin API that wires Pod network namespaces and assigns Pod IPs (Calico, Cilium, kindnet, …). | 02-networking/01-networking-model.md |
| Pod IP / Pod CIDR | The per-Pod IP and the cluster's Pod address range, allocated by the CNI/IPAM. | 02-networking/01-networking-model.md |
| Service | A stable virtual endpoint (name + ClusterIP) that load-balances to a label-selected, dynamic set of Pods. | 02-networking/02-services.md |
| ClusterIP | The default Service type: a stable in-cluster virtual IP, not reachable from outside the cluster. | 02-networking/02-services.md |
| NodePort | A Service type that also exposes the Service on a static port on every node. | 02-networking/02-services.md |
| LoadBalancer | A Service type that provisions an external cloud load balancer (no-op on bare local clusters). | 02-networking/02-services.md |
| ExternalName | A Service that maps a name to an external DNS CNAME, with no proxying. | 02-networking/03-dns-and-discovery.md |
| EndpointSlice | The scalable object listing a Service's ready backend Pod IPs/ports; replaced the legacy Endpoints object. |
02-networking/02-services.md |
| Endpoints | The legacy per-Service object listing backend addresses; superseded by EndpointSlice. | 02-networking/02-services.md |
| Headless Service (discovery) | clusterIP: None; DNS returns the Pod IPs directly for client-side load balancing / stable identities. |
02-networking/03-dns-and-discovery.md |
| CoreDNS | The cluster DNS server (a Deployment in kube-system) that resolves Service/Pod names to ClusterIPs/Pod IPs. |
02-networking/03-dns-and-discovery.md |
| Service FQDN | <SVC>.<NS>.svc.cluster.local; the fully-qualified name CoreDNS resolves for a Service. |
02-networking/03-dns-and-discovery.md |
ndots / search domains |
resolv.conf settings that cause short names to be tried with appended search domains (a classic latency footgun). |
02-networking/03-dns-and-discovery.md |
| Ingress | An API object defining HTTP/HTTPS host/path routing from outside the cluster to Services, realized by an ingress controller. | 02-networking/04-ingress.md |
| Ingress controller | The component (e.g. ingress-nginx) that watches Ingress objects and programs an actual L7 proxy. | 02-networking/04-ingress.md |
| IngressClass | Selects which ingress controller implements a given Ingress object. | 02-networking/04-ingress.md |
| TLS termination | Decrypting HTTPS at the edge (Ingress/Gateway) using a certificate from a TLS Secret. | 02-networking/04-ingress.md |
| Gateway API | The successor to Ingress: role-oriented CRDs (GatewayClass, Gateway, HTTPRoute) for richer, portable L4/L7 routing. (Also see 13 ch.07 for the v2 edge — Istio Gateway + Coraza WAF + per-tenant rate limiting via Envoy.) |
02-networking/05-gateway-api.md |
| GatewayClass / Gateway / HTTPRoute | Gateway API kinds: the implementation class, a configured listener/data-plane, and the route rules attached to it. | 02-networking/05-gateway-api.md |
| NetworkPolicy | A namespaced firewall: label-selected ingress/egress allow rules; a Pod selected by any policy defaults to deny for that direction. | 02-networking/06-network-policies.md |
| default-deny | A NetworkPolicy selecting all Pods with no rules, so only explicitly-allowed traffic flows (zero-trust baseline). | 02-networking/06-network-policies.md |
| Network segmentation | The pattern of partitioning Pod-to-Pod traffic with NetworkPolicies so a compromise can't move laterally. | 02-networking/06-network-policies.md |
| Service mesh | An L7 networking layer (Istio/Linkerd) adding mTLS, traffic shaping, and telemetry via sidecars/proxies; conceptual-only in this guide. | 02-networking/02-services.md |
Configuration & storage¶
| Term | Definition | Covered in |
|---|---|---|
| ConfigMap | An API object holding non-confidential key/value config, consumable as env vars, envFrom, or mounted files. |
03-config-and-storage/01-configmaps.md |
envFrom |
A field that injects all keys of a ConfigMap/Secret as environment variables into a container. | 03-config-and-storage/01-configmaps.md |
| Immutable ConfigMap/Secret | immutable: true freezes the data, improving performance and preventing accidental edits (replace, don't mutate). |
03-config-and-storage/01-configmaps.md |
| Secret | Like a ConfigMap but for sensitive data; values are base64-encoded (not encrypted by default) — RBAC and encryption-at-rest are separate. | 03-config-and-storage/02-secrets.md |
| Encryption at rest | Apiserver-level encryption of Secret data in etcd, configured via an EncryptionConfiguration (optionally a KMS provider). |
05-security/04-secrets-and-cluster-hardening.md |
| External Secrets Operator | An operator that syncs secrets from an external store (Vault/cloud SM) into Kubernetes Secrets. | 05-security/04-secrets-and-cluster-hardening.md |
| Sealed Secrets | A controller that decrypts a Git-safe SealedSecret into a real Secret in-cluster. |
05-security/04-secrets-and-cluster-hardening.md |
| SOPS / KSOPS | SOPS encrypts secret values in Git; KSOPS is the Kustomize plugin that decrypts a SOPS secretGenerator at build time. |
07-delivery/02-packaging-kustomize.md |
| Downward API | A mechanism exposing Pod/container metadata (name, namespace, labels, resource limits) to the container via env or files. | 03-config-and-storage/03-volumes.md |
| Volume | Storage mounted into a Pod's containers; lifetime and semantics depend on the volume type. | 03-config-and-storage/03-volumes.md |
| emptyDir | A scratch volume created empty when a Pod is assigned to a node and deleted with the Pod (the canonical writable path under a read-only root FS). | 03-config-and-storage/03-volumes.md |
| hostPath | A volume mounting a path from the node's filesystem; powerful and risky — forbidden by PSA restricted/baseline. |
03-config-and-storage/03-volumes.md |
| projected volume | A volume combining several sources (ServiceAccount token, ConfigMap, Secret, downwardAPI) into one directory. | 03-config-and-storage/03-volumes.md |
| PersistentVolume (PV) | A cluster-scoped piece of provisioned storage with a lifecycle independent of any Pod. | 03-config-and-storage/04-persistent-storage.md |
| PersistentVolumeClaim (PVC) | A namespaced request for storage (size, access mode, StorageClass) that binds to a PV, giving a Pod durable storage. | 03-config-and-storage/04-persistent-storage.md |
| StorageClass | A named storage "tier" enabling dynamic PV provisioning via a CSI driver, with parameters and a reclaim/binding policy. | 03-config-and-storage/04-persistent-storage.md |
| CSI (Container Storage Interface) | The plugin API for storage drivers that provision/attach/mount volumes (cloud disks, local-path, …). | 03-config-and-storage/04-persistent-storage.md |
| Access mode | A PV/PVC capability: ReadWriteOnce (one node), ReadWriteOncePod, ReadOnlyMany, ReadWriteMany. |
03-config-and-storage/04-persistent-storage.md |
WaitForFirstConsumer |
A StorageClass volumeBindingMode that delays PV binding until a Pod is scheduled, so storage lands on the right topology. |
03-config-and-storage/04-persistent-storage.md |
| Reclaim policy | What happens to a PV when its PVC is deleted: Delete (free the disk) or Retain (keep for manual recovery). |
03-config-and-storage/04-persistent-storage.md |
| VolumeSnapshot / VolumeSnapshotClass | A point-in-time copy of a PVC via a CSI snapshotter, and the class/driver that creates it. | 03-config-and-storage/05-stateful-data-patterns.md |
| Stateful data patterns | Operational practices for data in Kubernetes: backups, migrations as Jobs, single-writer access, operators for databases. | 03-config-and-storage/05-stateful-data-patterns.md |
fsGroup / fsGroupChangePolicy |
A pod securityContext that group-owns a volume so a non-root process can write it; OnRootMismatch skips a slow recursive chown. |
05-security/02-pod-security.md |
Scheduling¶
| Term | Definition | Covered in |
|---|---|---|
| Scheduling (filter & score) | The scheduler's two phases: discard infeasible nodes (predicates), then rank the rest (priorities) and bind the best. | 04-scheduling/01-scheduler-and-nodes.md |
| nodeSelector | The simplest node constraint: schedule only onto nodes carrying the given labels. | 04-scheduling/02-affinity-taints-topology.md |
| Node affinity | Expressive node constraints (required/preferred) over node labels. |
04-scheduling/02-affinity-taints-topology.md |
| Pod affinity / anti-affinity | Co-locate (affinity) or spread apart (anti-affinity) Pods relative to other Pods by label and topology key. | 04-scheduling/02-affinity-taints-topology.md |
| Taint | A node mark (key=value:effect) that repels Pods unless they tolerate it. |
04-scheduling/02-affinity-taints-topology.md |
| Toleration | A Pod field allowing it to schedule onto nodes with a matching taint. | 04-scheduling/02-affinity-taints-topology.md |
| Taint effect | NoSchedule, PreferNoSchedule, or NoExecute (also evicts already-running, non-tolerating Pods). |
04-scheduling/02-affinity-taints-topology.md |
| Topology spread constraints | Rules that even out Pods across a topology domain (zone/node) within a maxSkew. |
04-scheduling/02-affinity-taints-topology.md |
| PriorityClass | A named priority value for Pods; higher-priority pending Pods can preempt lower-priority ones. | 04-scheduling/03-priority-and-preemption.md |
| Preemption | The scheduler evicting lower-priority Pods to make room for a pending higher-priority Pod that otherwise can't fit. | 04-scheduling/03-priority-and-preemption.md |
| Eviction | Removal of a running Pod — by the kubelet under node pressure, by preemption, or via the Eviction API (respects PDBs). | 06-production-readiness/05-reliability-and-disruptions.md |
| Binding | The act of assigning a Pod to a node (writing spec.nodeName), which the kubelet then actuates. |
04-scheduling/01-scheduler-and-nodes.md |
Security¶
| Term | Definition | Covered in |
|---|---|---|
| Authentication (authN) | Proving identity to the apiserver via a trusted authenticator (client cert, SA token, OIDC); Kubernetes stores no user records. | 05-security/01-authn-authz-rbac.md |
| Authorization (authZ) | Deciding if an authenticated identity may perform a request; authorizers (Node, RBAC, …) are OR'd. | 05-security/01-authn-authz-rbac.md |
| Admission control | Post-authZ plugins that mutate then validate a request before it is persisted (PSA, ResourceQuota, webhooks). | 05-security/01-authn-authz-rbac.md |
| RBAC (Role-Based Access Control) | Roles/ClusterRoles grant verbs on resources; RoleBindings/ClusterRoleBindings bind them to subjects. Purely additive (no deny). | 05-security/01-authn-authz-rbac.md |
| Role / ClusterRole | A namespaced (Role) or cluster-scoped/reusable (ClusterRole) set of permission rules. | 05-security/01-authn-authz-rbac.md |
| RoleBinding / ClusterRoleBinding | Binds a (Cluster)Role to subjects within one namespace, or cluster-wide, respectively. | 05-security/01-authn-authz-rbac.md |
| ServiceAccount (SA) | The in-cluster identity for workloads; every Pod runs as one (use a dedicated SA, never default). |
05-security/01-authn-authz-rbac.md |
| Bound / projected SA token | A short-lived, audience-scoped JWT bound to the Pod, issued via the TokenRequest API and auto-rotated (replaces legacy forever-tokens). | 05-security/01-authn-authz-rbac.md |
automountServiceAccountToken: false |
Stops the SA token from being mounted into a Pod that never calls the API (least privilege). | 05-security/01-authn-authz-rbac.md |
system:masters |
A super-group that bypasses RBAC entirely (unconditional cluster-admin); the kubeadm admin cert carries it — treat as break-glass. | 05-security/01-authn-authz-rbac.md |
auth can-i / SubjectAccessReview |
The API (and CLI) that asks the real authorizer whether an identity may do something — the authoritative audit tool. | 05-security/01-authn-authz-rbac.md |
Impersonation (--as) |
An RBAC-gated power (impersonate verb) to act as another user/group/SA, used to test policy. |
05-security/01-authn-authz-rbac.md |
| OIDC | OpenID Connect: how humans authenticate via an external identity provider; claims map to username/groups. (Also see 13 ch.04 for the Keycloak code+PKCE flow + JWKS + Istio JWT validation on the v2 platform; 10 ch.03 for IRSA's OIDC-federation form.) | 05-security/01-authn-authz-rbac.md |
securityContext |
Pod/container fields that drop privileges: runAsNonRoot, runAsUser, capabilities, readOnlyRootFilesystem, seccomp, … |
05-security/02-pod-security.md |
| Linux capabilities | Fine-grained slices of root's power; hardened Pods drop: ["ALL"] and add back only what's proven necessary. |
05-security/02-pod-security.md |
allowPrivilegeEscalation: false |
Sets no_new_privs so a child process can't gain more privilege than its parent (neutralizes setuid). |
05-security/02-pod-security.md |
privileged container |
A container with nearly all capabilities and device access (≈ root on the node); forbidden by baseline/restricted. |
05-security/02-pod-security.md |
readOnlyRootFilesystem |
Mounts the container root FS read-only (write only via explicit volumes); strong breakout mitigation, not a PSA requirement. | 05-security/02-pod-security.md |
seccomp / RuntimeDefault |
A syscall filter; restricted requires seccompProfile.type set to RuntimeDefault (or Localhost), never Unconfined. |
05-security/02-pod-security.md |
| AppArmor | A Linux Security Module confining a process to a profile; now a first-class field (appArmorProfile, GA in v1.30). |
05-security/02-pod-security.md |
| SELinux | The RHEL-family LSM labeling processes/files; configured via securityContext.seLinuxOptions. |
05-security/02-pod-security.md |
| Pod Security Admission (PSA) | The built-in, non-mutating validating admission controller enforcing a Pod Security Standard via namespace labels. | 05-security/02-pod-security.md |
| Pod Security Standards | The three fixed policy levels: privileged, baseline, restricted. |
05-security/02-pod-security.md |
PSA modes / -version |
enforce (reject), audit (log), warn (client warning); the -version label pins the ruleset to a Kubernetes minor. |
05-security/02-pod-security.md |
| PodSecurityPolicy (PSP) | The removed (v1.25) predecessor to PSA; any guide still recommending it is describing a dead API. | 05-security/02-pod-security.md |
| Supply chain security | Trusting what you run: signed images, SBOMs, vulnerability scanning, and admission policy gating untrusted artifacts. | 05-security/03-supply-chain.md |
| Trivy | An open-source scanner for image/filesystem/IaC vulnerabilities and misconfigurations, run in CI. | 05-security/03-supply-chain.md |
| Cosign | A Sigstore tool to sign and verify container images (and other OCI artifacts), enabling signature-based admission. | 05-security/03-supply-chain.md |
| SBOM (Software Bill of Materials) | A machine-readable inventory of an artifact's components/dependencies (SPDX/CycloneDX) for provenance and CVE triage. | 05-security/03-supply-chain.md |
| Kyverno | A Kubernetes-native policy engine (validate/mutate/generate) used as a validating admission webhook. | 05-security/03-supply-chain.md |
| Admission webhook (mutating/validating) | An external HTTP callback the apiserver invokes during admission to mutate or validate objects. | 05-security/01-authn-authz-rbac.md |
| ValidatingAdmissionPolicy | In-tree, CEL-based validating admission policy — a webhook-free alternative for many policy checks. | 05-security/03-supply-chain.md |
| Audit logging | The apiserver's structured record of every request (who/what/verdict); how RBAC/PSA decisions are reviewed and alarmed on. | 05-security/04-secrets-and-cluster-hardening.md |
| CIS Benchmark | A consensus security configuration baseline for Kubernetes used to harden the cluster. | 05-security/04-secrets-and-cluster-hardening.md |
| Cluster hardening | Reducing cluster attack surface: encryption at rest, audit, restricted RBAC, network policy, no system:masters sprawl. |
05-security/04-secrets-and-cluster-hardening.md |
Production readiness — observability, scaling, reliability¶
| Term | Definition | Covered in |
|---|---|---|
| Observability | The ability to understand system state from its outputs — the three signals: metrics, logs, traces. | 06-production-readiness/01-observability-metrics.md |
| Prometheus | A pull-based time-series metrics system that scrapes /metrics endpoints and stores samples for querying/alerting. |
06-production-readiness/01-observability-metrics.md |
| PromQL | Prometheus's query language for selecting and aggregating time series (rates, quantiles, alerts). | 06-production-readiness/01-observability-metrics.md |
| ServiceMonitor | A Prometheus Operator CRD declaring which Services/endpoints Prometheus should scrape. | 06-production-readiness/01-observability-metrics.md |
| PrometheusRule | A Prometheus Operator CRD defining recording and alerting rules. | 06-production-readiness/01-observability-metrics.md |
| Prometheus Operator | The operator that manages Prometheus/Alertmanager and consumes ServiceMonitor/PrometheusRule CRDs. | 06-production-readiness/01-observability-metrics.md |
| kube-prometheus-stack | The Helm chart bundling Prometheus Operator, Prometheus, Alertmanager, and Grafana. | 06-production-readiness/01-observability-metrics.md |
| Grafana | A dashboarding/visualization tool that queries Prometheus (and other sources). | 06-production-readiness/01-observability-metrics.md |
| metrics-server | A lightweight cluster aggregator of CPU/memory for kubectl top and the HPA's resource metrics. |
06-production-readiness/04-autoscaling.md |
| Logging architecture | Containers write to stdout/stderr; a node DaemonSet ships logs to a backend (no app-side log files). | 06-production-readiness/02-logging.md |
| Loki | A horizontally-scalable log aggregation system that indexes labels, queried with LogQL. | 06-production-readiness/02-logging.md |
| Tracing / distributed tracing | Following one request across services as a trace of timed spans, to find latency and failures. | 06-production-readiness/03-tracing.md |
| OpenTelemetry (OTel) | The vendor-neutral standard/SDKs and Collector for emitting and exporting traces/metrics/logs. | 06-production-readiness/03-tracing.md |
| Span / trace context | A span is one timed operation; trace context (e.g. W3C traceparent) is propagated so spans join into one trace. |
06-production-readiness/03-tracing.md |
| Autoscaling | Automatically adjusting capacity to demand: pod count (HPA/KEDA), pod size (VPA), or node count (Cluster Autoscaler). | 06-production-readiness/04-autoscaling.md |
| HPA (HorizontalPodAutoscaler) | Scales a workload's replica count on observed metrics (CPU/memory or custom); autoscaling/v2. |
06-production-readiness/04-autoscaling.md |
| VPA (VerticalPodAutoscaler) | Recommends/sets right-sized requests/limits for a workload's containers. | 06-production-readiness/04-autoscaling.md |
| KEDA | Event-driven autoscaling via a ScaledObject on external sources (queue depth, etc.). KEDA creates & manages an HPA for scaling above zero, and takes direct control for scale-to-zero (below the HPA minimum of 1). |
06-production-readiness/04-autoscaling.md |
ScaledObject / TriggerAuthentication |
KEDA CRDs: the scaling target+triggers, and the credentials a trigger uses. | 06-production-readiness/04-autoscaling.md |
| Cluster Autoscaler / Karpenter | Node-level autoscalers that add/remove nodes when Pods are unschedulable or nodes are underutilized. | 06-production-readiness/06-capacity-and-cost.md |
| PodDisruptionBudget (PDB) | A floor on available replicas during voluntary disruptions (drains/upgrades); the Eviction API respects it. | 06-production-readiness/05-reliability-and-disruptions.md |
| Voluntary vs involuntary disruption | Operator-initiated (drain, rollout — PDB applies) vs unavoidable (node crash, OOM — PDB does not apply). | 06-production-readiness/05-reliability-and-disruptions.md |
| SLI / SLO / error budget | A measured indicator, a target for it, and the allowed amount of failure before you stop shipping risky changes. | 06-production-readiness/05-reliability-and-disruptions.md |
| Capacity planning | Sizing requests/limits and node pools so workloads fit with headroom, balancing reliability against cost. | 06-production-readiness/06-capacity-and-cost.md |
| Cost allocation / OpenCost | Attributing cluster spend to namespaces/workloads via labels; OpenCost is the CNCF cost model. | 06-production-readiness/06-capacity-and-cost.md |
| Bin packing | Scheduling Pods densely onto fewer nodes (via requests) to reduce idle, traded against blast radius. | 06-production-readiness/06-capacity-and-cost.md |
| Spot / preemptible nodes | Cheap, reclaimable cloud capacity used for fault-tolerant workloads (paired with PDBs and disruption handling). | 06-production-readiness/06-capacity-and-cost.md |
Delivery — packaging, CI/CD, GitOps¶
| Term | Definition | Covered in |
|---|---|---|
| Helm | A package manager for Kubernetes: a templated, versioned chart is rendered with values and installed as a tracked release. | 07-delivery/01-packaging-helm.md |
| Chart | A directory of templated manifests + values.yaml + Chart.yaml (metadata/version); inert until rendered. |
07-delivery/01-packaging-helm.md |
Values / values.schema.json |
A chart's documented tuning surface (precedence: --set > -f > defaults), optionally JSON-Schema validated. |
07-delivery/01-packaging-helm.md |
| Release / revision | A named install of a chart, stored as an in-cluster Secret; each helm upgrade is a new revision (helm rollback). |
07-delivery/01-packaging-helm.md |
| Helm hook | A chart object annotated to run at a lifecycle event (e.g. post-install,post-upgrade for the DB-migrate Job). |
07-delivery/01-packaging-helm.md |
_helpers.tpl / named template |
Reusable template snippets included across a chart so repeated YAML (labels, security context, DSN) can't drift. |
07-delivery/01-packaging-helm.md |
| Library vs application chart | type: application installs into a cluster; type: library ships only helpers for other charts to include. |
07-delivery/01-packaging-helm.md |
| Tiller | The removed Helm 2 in-cluster server (a cluster-admin backdoor); Helm 3 is client-only. | 07-delivery/01-packaging-helm.md |
| Kustomize | Template-free customization: a base of plain manifests plus typed, declarative overlays/patches; built into kubectl. |
07-delivery/02-packaging-kustomize.md |
| Base / overlay | The deployable app (base) and a small per-environment diff (overlay) that includes it and layers transformations. | 07-delivery/02-packaging-kustomize.md |
| Component (Kustomize) | A reusable, optional kind: Component mix-in an overlay opts into (the analog of a Helm value toggle). |
07-delivery/02-packaging-kustomize.md |
| Strategic-merge vs JSON6902 patch | A field-aware partial-object merge vs a precise RFC-6902 op/path/value; both under the unified patches:. |
07-delivery/02-packaging-kustomize.md |
commonLabels footgun |
Adding labels via commonLabels mutates immutable spec.selector and wedges upgrades; use labels: with includeSelectors: false. |
07-delivery/02-packaging-kustomize.md |
| configMap/secretGenerator | Kustomize generators that synthesize a ConfigMap/Secret and append a content-hash suffix to trigger rollouts on change. | 07-delivery/02-packaging-kustomize.md |
| CI/CD pipeline | Automated build → test → scan → image push → manifest update; the path from commit to a deployable artifact. | 07-delivery/03-cicd-pipeline.md |
| In-cluster build (Kaniko/BuildKit) | Building container images inside the cluster without a Docker daemon (a CI alternative, note-only in this guide). | 07-delivery/03-cicd-pipeline.md |
| GitOps | Git is the single source of truth; a controller continuously reconciles the cluster to the repo and auto-corrects drift. | 07-delivery/04-gitops-argocd.md |
| Argo CD | A GitOps controller that renders manifests/charts/kustomizations from Git and reconciles them into clusters. | 07-delivery/04-gitops-argocd.md |
Argo CD Application |
The CRD binding a Git source (repo/path/revision) to a destination (cluster/namespace) with a sync policy. | 07-delivery/04-gitops-argocd.md |
Argo CD AppProject |
A CRD constraining which repos/clusters/namespaces/kinds a group of Applications may use (multi-tenancy guardrails). | 07-delivery/04-gitops-argocd.md |
| App-of-Apps / ApplicationSet | Patterns to manage many Applications from one (App-of-Apps) or generate them from a generator (ApplicationSet). | 07-delivery/04-gitops-argocd.md |
| Sync wave / hook (Argo) | Ordering primitives that stage a sync (e.g. run the DB-migrate Job before the app), analogous to Helm hooks. | 07-delivery/04-gitops-argocd.md |
| Drift detection | A GitOps controller noticing live state diverging from Git and reporting/reverting it. | 07-delivery/04-gitops-argocd.md |
| Progressive delivery | Automated, metric-gated rollout (canary/blue-green) that promotes or rolls back based on analysis. | 07-delivery/05-progressive-delivery.md |
Argo Rollouts / Rollout |
A controller and CRD replacing Deployment to drive canary/blue-green steps with analysis gates. | 07-delivery/05-progressive-delivery.md |
| AnalysisTemplate / AnalysisRun | Argo Rollouts CRDs defining the success metrics queried during a rollout step and one execution of them. | 07-delivery/05-progressive-delivery.md |
Day-2 operations¶
| Term | Definition | Covered in |
|---|---|---|
| Cluster lifecycle | Provisioning, upgrading, and decommissioning clusters and nodes over time (managed service, kubeadm, or Cluster API). | 08-day-2-operations/01-cluster-lifecycle.md |
| kubeadm | The upstream tool that bootstraps a conformant control plane and joins nodes. | 08-day-2-operations/01-cluster-lifecycle.md |
| Cluster API (CAPI) | A Kubernetes-style declarative API for provisioning and managing clusters as objects. | 08-day-2-operations/01-cluster-lifecycle.md |
| Version skew | The supported version differences between control plane, kubelets, and kubectl (±1 minor) — mismatch causes odd errors. |
08-day-2-operations/01-cluster-lifecycle.md |
| cordon / drain / uncordon | Mark a node unschedulable, evict its Pods respecting PDBs, then return it to rotation — the safe node-maintenance sequence. | 08-day-2-operations/01-cluster-lifecycle.md |
| Backup & disaster recovery (DR) | Capturing etcd and persistent data and the rehearsed procedure to restore service after loss. | 08-day-2-operations/02-backup-and-dr.md |
| etcd snapshot | A point-in-time backup of etcd (etcdctl snapshot save); backing up etcd is backing up cluster state. |
08-day-2-operations/02-backup-and-dr.md |
| Velero | A tool that backs up/restores cluster objects and PV data (snapshots) and supports migration/DR. | 08-day-2-operations/02-backup-and-dr.md |
| RPO / RTO | Recovery Point Objective (max tolerable data loss) and Recovery Time Objective (max tolerable downtime). | 08-day-2-operations/02-backup-and-dr.md |
| Troubleshooting method | The fixed pipeline: observe → isolate → hypothesize → test → fix → verify; describe→events→logs→kubectl debug. |
08-day-2-operations/03-troubleshooting-playbook.md |
kubectl debug / ephemeral container |
Inject a tooling container into a running Pod (shares its PID/network namespaces) — the correct way to debug distroless Pods. | 08-day-2-operations/03-troubleshooting-playbook.md |
kubectl debug --profile |
Shapes the debug container's security (restricted/general/sysadmin/netadmin) so PSA admits it (GA in v1.30). |
08-day-2-operations/03-troubleshooting-playbook.md |
| CrashLoopBackOff | A container that starts then exits repeatedly, with exponential backoff; diagnose with logs --previous. |
08-day-2-operations/03-troubleshooting-playbook.md |
| ImagePullBackOff | The kubelet cannot pull the image (bad ref/tag, missing pull secret, not loaded) and is backing off. | 08-day-2-operations/03-troubleshooting-playbook.md |
| CreateContainerConfigError | A referenced ConfigMap/Secret key is missing, so the container spec can't be materialized. | 08-day-2-operations/03-troubleshooting-playbook.md |
| Multi-tenancy | Sharing a cluster across teams/apps with isolation via namespaces, RBAC, quotas, and NetworkPolicies. | 08-day-2-operations/04-multi-tenancy-and-namespaces.md |
| CRD (CustomResourceDefinition) | An extension registering a new resource type with the apiserver so custom objects are served like built-ins. | 08-day-2-operations/05-operators-and-crds.md |
| Custom Resource (CR) | An instance of a CRD-defined kind, stored and watched like any object. | 08-day-2-operations/05-operators-and-crds.md |
| Operator | A custom controller encoding operational knowledge for an app (e.g. a database), reconciling its CRs. | 08-day-2-operations/05-operators-and-crds.md |
| Operator pattern | CRD (the API) + controller (the reconcile loop) = automated day-2 operations for stateful/complex software. | 08-day-2-operations/05-operators-and-crds.md |
| Reconcile loop (controller-runtime) | The operator's Reconcile(req) function: read the CR, observe reality, act to converge, requeue. |
08-day-2-operations/05-operators-and-crds.md |
| CloudNativePG (CNPG) | A PostgreSQL operator managing HA Postgres clusters (failover, backups) via its Cluster CRD. |
08-day-2-operations/05-operators-and-crds.md |
| finalizer | A metadata.finalizers entry that blocks deletion until a controller does cleanup, then removes the finalizer. |
08-day-2-operations/05-operators-and-crds.md |
| ownerReference / garbage collection | A child's link to its parent; deleting the parent cascades GC of children (e.g. ReplicaSet → Pods). | 08-day-2-operations/05-operators-and-crds.md |
| Structural schema / OpenAPI validation | The CRD's required OpenAPI v3 schema that the apiserver uses to validate and prune custom resources. | 08-day-2-operations/05-operators-and-crds.md |
| Conversion webhook | A webhook that converts a CRD's custom resources between served API versions. | 08-day-2-operations/05-operators-and-crds.md |
| API deprecation policy | The rule that GA APIs are supported for a defined window and removed only after deprecation; pin versions and migrate deliberately. | appendix/C-yaml-and-api-conventions.md |
| alpha / beta / stable (GA) | API maturity levels: alpha (off by default, may change), beta (on, may change), stable (long-term support). | appendix/C-yaml-and-api-conventions.md |
app.kubernetes.io/* labels |
The recommended common label set (name, instance, version, component, part-of, managed-by) for consistent selection/cost/ops. |
00-foundations/06-declarative-api-model.md |
Part 10 — Cloud & Managed Kubernetes¶
| Term | Definition | Covered in |
|---|---|---|
| Managed Kubernetes | A cloud-vendor offering (EKS / GKE / AKS / DOKS / OKE / Linode LKE) where the control plane (apiserver + etcd + scheduler + controllers) is run, patched, scaled, backed up, and SLA'd by the provider; the customer owns the data plane (nodes, workloads, in-cluster security, networking wiring, IAM mapping). | 10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md |
| Shared responsibility model | The dividing line between provider and customer on a managed cluster: provider = control plane and its SLA; customer = nodes, Pods, RBAC, app SLOs, and any cloud IAM glue. | 10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md |
| EKS / GKE / AKS | The three large managed Kubernetes services: Amazon Elastic Kubernetes Service, Google Kubernetes Engine, and Azure Kubernetes Service. | 10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md |
| Control-plane SLA | The provider's uptime guarantee for the apiserver / etcd (e.g. 99.95% on a regional EKS/GKE/AKS); your application SLO is separate and your responsibility. | 10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md |
| Control-plane vs node-pool upgrade | Two separate upgrade operations on managed clusters: the provider upgrades the control plane (you trigger / it auto-upgrades), and you separately roll your node pools — each respects version skew. | 10-cloud-and-managed-kubernetes/01-managed-kubernetes-model.md |
| IRSA (IAM Roles for Service Accounts) | AWS's pod-identity mechanism: a ServiceAccount is annotated with an IAM role ARN; the SA's projected OIDC token is exchanged with STS for short-lived AWS credentials. Solves "no static AWS keys in Pods". | 10-cloud-and-managed-kubernetes/03-cloud-identity.md |
| EKS Pod Identity | AWS's newer alternative to IRSA: pod-identity association handled by an agent DaemonSet, no in-cluster OIDC role-arn annotations needed. | 10-cloud-and-managed-kubernetes/03-cloud-identity.md |
| Workload Identity (GCP) | GKE's pod-identity mechanism: a Kubernetes SA is bound to a Google Service Account; pods exchange their projected SA token for a GSA token via the GKE metadata server. | 10-cloud-and-managed-kubernetes/03-cloud-identity.md |
| Azure AD Workload Identity | Azure's pod-identity mechanism: a Kubernetes SA's federated credential is mapped to an Azure AD application/managed identity, exchanged via MSAL for an Azure token. | 10-cloud-and-managed-kubernetes/03-cloud-identity.md |
| OIDC issuer URL | A cluster-published URL that signs ServiceAccount JWTs and exposes a JWKS; cloud IAM is configured to trust that issuer to enable pod-identity federation. | 10-cloud-and-managed-kubernetes/03-cloud-identity.md |
| Pod identity federation | The general pattern: the cluster's OIDC-signed projected SA token is exchanged with a cloud STS for short-lived, scoped cloud credentials — no static keys ever stored. | 10-cloud-and-managed-kubernetes/03-cloud-identity.md |
| VPC CNI (AWS) | The default EKS CNI: every Pod gets a real VPC IP assigned via ENI secondary IPs; pod IP density is bounded by per-instance ENI/IP limits unless prefix delegation is enabled. | 10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
| Prefix delegation | A VPC CNI feature that assigns /28 prefixes (16 IPs) per ENI instead of one IP each, raising per-instance Pod density on EKS. |
10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
| GKE CNI / GKE Dataplane V2 | GKE's CNI options — the legacy "kubenet" and the Cilium-based Dataplane V2 (eBPF, kube-proxy replacement, network policies). | 10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
| Azure CNI Overlay | AKS's overlay CNI: Pods get an overlay IP from a per-node CIDR (not a VNet IP), removing the VNet IP-exhaustion footgun of Azure CNI "VNet" mode. | 10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
Cloud LoadBalancer / Service type: LoadBalancer |
A Service that provisions a real cloud LB (NLB/ALB/internal LB on AWS, GCLB on GCP, Azure LB) via the cloud-controller-manager; a no-op on bare local clusters. | 10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
| AWS Load Balancer Controller (LBC) | A controller that watches Ingress (and Service) objects and provisions ALBs / NLBs accordingly; the modern replacement for the older in-tree ELB integration. |
10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
| Cloud CSI driver | A vendor CSI implementation that provisions/attaches cloud block or file volumes from a StorageClass (EBS / PD / Azure Disk for block-RWO; EFS / Filestore / Azure Files for file-RWX). Installed as a managed add-on or pinned Helm. | 10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md |
| EBS-CSI / PD-CSI / Azure Disk CSI | The cloud block-storage CSI drivers (ebs.csi.aws.com, pd.csi.storage.gke.io, disk.csi.azure.com) — block, RWO, zonally bound, snapshot-capable. |
10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md |
| EFS / Filestore / Azure Files | The cloud file-storage offerings (NFS-like, RWX) used when multiple Pods on multiple nodes need to share a volume. | 10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md |
Cloud snapshot / VolumeSnapshot on cloud CSI |
A CSI-orchestrated point-in-time snapshot of a cloud disk (EBS / PD / Azure Disk snapshot) realized via the VolumeSnapshot API and the cloud snapshotter. |
10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md |
| Cloud secret store (AWS Secrets Manager / GCP Secret Manager / Azure Key Vault) | Provider secret services consumed in-cluster via ESO or the CSI Secrets Store driver; the source of truth lives in the cloud, not in etcd. | 10-cloud-and-managed-kubernetes/05-cloud-storage-and-data.md |
| Cloud-managed Prometheus (AMP / GMP) | AWS Managed Prometheus and Google Managed Service for Prometheus — provider-hosted Prometheus-compatible TSDBs you remote-write to from in-cluster scrapers. | 10-cloud-and-managed-kubernetes/04-cloud-networking-and-load-balancing.md |
| Cluster Autoscaler (CA) | The classic node-level autoscaler: tied to ASGs / MIGs / VMSS, scales up when a Pod is unschedulable, down when a node is underutilized — node groups are pre-defined. | 10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md |
| Karpenter | A node-level autoscaler that provisions right-sized EC2 instances just-in-time directly from EC2 fleet APIs (no ASG), consolidates nodes, mixes spot + on-demand, and binds Pod requirements (architecture / taints / topology) into the launch decision. | 10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md |
NodePool / EC2NodeClass (Karpenter) |
Karpenter CRDs: NodePool declares Pod-targeted constraints (instance types, zones, taints, limits, weight) and disruption policy; EC2NodeClass declares the AWS-side launch template (AMI, subnets, security groups, IRSA). |
10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md |
| Consolidation (Karpenter) | The continuous right-sizing loop: Karpenter periodically replaces under-used nodes with cheaper / smaller ones if all current Pods would still fit; the structural reason Karpenter often costs less than CA. | 10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md |
| Spot / preemptible / Spot VMs | Provider-cheap reclaimable instances (AWS Spot, GCP preemptible/Spot VMs, Azure Spot VMs); used for fault-tolerant workloads paired with PDBs and disruption handlers. | 10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md |
| Multi-AZ / multi-region | The two HA tiers: spreading nodes across availability zones in one region (cheap, common); replicating clusters across regions (expensive, only for the highest tier) — different blast radii and SLO costs. | 10-cloud-and-managed-kubernetes/06-node-autoscaling-cost-multicloud.md |
| Terraform / OpenTofu | Declarative HCL infra-as-code: cluster + node pools + VPC defined as resources; remote state (S3/GCS + lock) is the production discipline. The guide's reference IaC for managed clusters. | 10-cloud-and-managed-kubernetes/02-provisioning-and-iac.md |
eksctl / gcloud container clusters create / az aks create |
The provider CLIs that stand up a managed cluster + a managed node group in one command — fastest path, weakest provenance versus IaC. | 10-cloud-and-managed-kubernetes/02-provisioning-and-iac.md |
Part 11 — Advanced Production Patterns¶
| Term | Definition | Covered in |
|---|---|---|
| Operator pattern | A CRD + a controller that encodes operational knowledge for a stateful or complex application; "an SRE for that app, in code". (Introduced in Part 08 ch.05; built in Part 11 ch.02.) | 11-advanced-production-patterns/02-operator-development.md |
| Kubebuilder | The upstream SIG scaffolding tool for Go operators built on controller-runtime; the guide's reference framework. |
11-advanced-production-patterns/02-operator-development.md |
| Operator SDK | An alternative scaffolding tool (now sharing the controller-runtime foundation with Kubebuilder); also supports Helm/Ansible-based operators. |
11-advanced-production-patterns/02-operator-development.md |
controller-runtime |
The Go library (manager, controller, client, cache, builder, finalizer helpers) underneath both Kubebuilder and Operator SDK. | 11-advanced-production-patterns/02-operator-development.md |
| Status conditions / status subresource | The standard pattern for reporting reconcile outcome on a CR: an array of typed conditions (type, status, reason, message, lastTransitionTime) written via a dedicated /status subresource so spec/status writes don't clobber each other. |
11-advanced-production-patterns/02-operator-development.md |
| envtest | The controller-runtime integration-test harness: spins up a real local etcd + kube-apiserver (no kubelet/scheduler) so reconcile logic can be tested against a real apiserver in CI. |
11-advanced-production-patterns/02-operator-development.md |
| MutatingWebhookConfiguration / ValidatingWebhookConfiguration | The cluster-scoped CRD-like objects that register an admission webhook with the apiserver: which resources/verbs trigger it, which Service it calls, and its CA bundle. | 11-advanced-production-patterns/01-admission-webhooks.md |
failurePolicy / sideEffects / reinvocationPolicy |
Admission-webhook safety knobs: failurePolicy: Fail rejects on outage, Ignore allows; sideEffects: None is the recommended honest default; reinvocationPolicy: IfNeeded allows a mutator to be re-invoked after later mutations. |
11-advanced-production-patterns/01-admission-webhooks.md |
| APF (API Priority and Fairness) | The apiserver's built-in self-protection: incoming requests are classified by a FlowSchema into a PriorityLevelConfiguration, with shuffle-sharded fair queues — protects the apiserver from a single hot client. | 11-advanced-production-patterns/03-api-priority-and-fairness.md |
| FlowSchema | The APF classifier: match by user / SA / group + verb / resource / namespace, with matchingPrecedence ordering and a distinguisherMethod (ByUser / ByNamespace) that picks the per-flow queue. |
11-advanced-production-patterns/03-api-priority-and-fairness.md |
| PriorityLevelConfiguration | The APF queue level: assuredConcurrencyShares (relative weight of in-flight slots), queueLengthLimit, queues, handSize, borrowingLimit — defines the queueing/serving behavior for matched flows. |
11-advanced-production-patterns/03-api-priority-and-fairness.md |
| Service mesh | An infrastructure layer (Istio / Linkerd / Cilium service mesh) that gives every service-to-service call mTLS-by-default identity, L7 traffic management (canary / mirror / retry / timeout / outlier detection), and uniform telemetry. | 11-advanced-production-patterns/04-service-mesh.md |
| Istio | A mature service mesh; Gateway / VirtualService / DestinationRule + (newer) Gateway-API HTTPRoute, with two data-plane modes: sidecar (Envoy per Pod) and ambient. |
11-advanced-production-patterns/04-service-mesh.md |
| Ambient mode (Istio) | A sidecar-less Istio data plane: an L4 ztunnel DaemonSet on every node does mTLS; an optional L7 waypoint Deployment per namespace handles HTTP-level policy — only the Pods that need L7 pay the L7 cost. |
11-advanced-production-patterns/04-service-mesh.md |
| Waypoint proxy | The L7-policy proxy in Istio ambient: a Deployment per namespace (or per SA) that traffic for that scope routes through when L7 features are needed. | 11-advanced-production-patterns/04-service-mesh.md |
| Linkerd | A lighter-weight CNCF service mesh: Rust micro-proxy sidecar, mTLS by default, simpler control plane than Istio. | 11-advanced-production-patterns/04-service-mesh.md |
| SPIFFE / SPIRE | A cloud-native standard for workload identity: SPIFFE defines the SVID (an X.509 cert or JWT bound to a workload), SPIRE is the reference issuer; the substrate for mTLS between services across clusters/providers. | 11-advanced-production-patterns/04-service-mesh.md |
| mTLS (mutual TLS) | Both peers present and verify certificates — the service-mesh default; in Istio, PeerAuthentication: STRICT enforces it cluster-wide. |
11-advanced-production-patterns/04-service-mesh.md |
| External Secrets Operator (ESO) | A controller that pulls secrets from an external store (Vault / AWS Secrets Manager / GCP Secret Manager / Azure Key Vault / 1Password / …) into Kubernetes Secrets on a schedule; defined by SecretStore / ClusterSecretStore + ExternalSecret. |
11-advanced-production-patterns/05-secrets-at-scale.md |
SecretStore / ClusterSecretStore |
ESO CRDs: the connection + auth to an external secret provider (namespaced or cluster-scoped). | 11-advanced-production-patterns/05-secrets-at-scale.md |
ExternalSecret |
An ESO CRD that declares "produce a Secret named X with these keys mapped from this SecretStore"; refreshed on refreshInterval. |
11-advanced-production-patterns/05-secrets-at-scale.md |
| Vault (HashiCorp) | A widely-used external secret store with dynamic-secret backends (DB credentials minted on demand), Transit (encryption-as-a-service), PKI, and a Kubernetes auth method that exchanges a projected SA token for a Vault token. | 11-advanced-production-patterns/05-secrets-at-scale.md |
| Vault Agent Injector | A Vault mutating webhook that injects an init/sidecar Vault Agent into a Pod to render secrets to a file (instead of into a Kubernetes Secret); ESO's alternative path. |
11-advanced-production-patterns/05-secrets-at-scale.md |
| CSI Secrets Store driver | A CSI driver (secrets-store.csi.x-k8s.io) that mounts secrets from external stores directly into a Pod as a volume (no Secret materialized), with optional secretObjects to sync into a real Secret. |
11-advanced-production-patterns/05-secrets-at-scale.md |
| Dynamic secret | A short-lived credential minted on request (e.g. a Vault DB-engine username/password valid for 1h); contrast with a long-lived static secret in etcd. | 11-advanced-production-patterns/05-secrets-at-scale.md |
| Multi-cluster / fleet | Running more than one cluster as a coordinated set, for blast-radius / region / hard-tenancy / regulatory reasons; topologies include per-env, per-region, per-tenant, and hub-and-spoke. | 11-advanced-production-patterns/06-multi-cluster-and-fleet.md |
Argo CD ApplicationSet |
A template + generator (list / cluster / git / matrix / merge / scm / pull-request) that produces N Applications from one declaration — the GitOps multi-cluster primitive. |
11-advanced-production-patterns/06-multi-cluster-and-fleet.md |
| Karmada | A CNCF multi-cluster orchestrator: workloads declared once at the Karmada control plane and propagated to member clusters via PropagationPolicy / OverridePolicy. |
11-advanced-production-patterns/06-multi-cluster-and-fleet.md |
| Cluster API (CAPI) | A Kubernetes-style declarative API for provisioning and managing whole clusters as objects, with provider implementations (CAPA / CAPG / CAPZ / CAPV / …). | 11-advanced-production-patterns/06-multi-cluster-and-fleet.md |
| Hub-and-spoke vs leader-and-followers | Two multi-cluster topologies: hub-and-spoke = one management cluster pushing to many workload clusters (Argo CD ApplicationSet, Karmada); leader-and-followers = symmetric clusters with one elected leader (Submariner, KubeFed-style). | 11-advanced-production-patterns/06-multi-cluster-and-fleet.md |
| Chaos engineering | Disciplined experimentation that defines a steady-state hypothesis, bounds the blast radius, injects a failure, observes, and learns; not random breakage. | 11-advanced-production-patterns/07-chaos-engineering.md |
| Chaos Mesh | A CNCF chaos-engineering platform with rich experiment CRDs: PodChaos (kill/failure/container-kill), NetworkChaos (latency/loss/partition), StressChaos (CPU/memory), IOChaos, TimeChaos, and a Workflow to chain them. (Also see 13 ch.12 for the quarterly chaos game-day discipline on the v2 platform.) |
11-advanced-production-patterns/07-chaos-engineering.md |
| Litmus | Another CNCF chaos-engineering platform (ChaosExperiment / ChaosEngine / ChaosResult); a sibling option to Chaos Mesh in the same space. | 11-advanced-production-patterns/07-chaos-engineering.md |
| Steady-state hypothesis | The pre-experiment statement of what "normal" looks like (latency / error rate / throughput); the experiment is judged by whether reality stays within it. | 11-advanced-production-patterns/07-chaos-engineering.md |
| Blast radius | The bounded scope of a chaos experiment (one Pod / one namespace / one zone) — the discipline that distinguishes engineering from breakage. | 11-advanced-production-patterns/07-chaos-engineering.md |
| HA control plane | The configuration where the apiserver and etcd are replicated (typically across 3 nodes / 3 zones) so a single-node failure does not lose the cluster; the deployment topology behind every managed cluster's control-plane SLA. | 11-advanced-production-patterns/08-ha-control-plane-and-etcd.md |
| Stacked vs external etcd | Two HA topologies: stacked = etcd colocated on the control-plane nodes (default kubeadm HA); external = a dedicated etcd cluster on its own VMs (more isolation, more nodes to operate). | 11-advanced-production-patterns/08-ha-control-plane-and-etcd.md |
| etcd raft quorum | The majority needed to commit a write to etcd: ⌈(N+1)/2⌉ — 3 nodes tolerate 1 loss, 5 tolerate 2; sizing the etcd cluster picks a fault tolerance vs latency point. | 11-advanced-production-patterns/08-ha-control-plane-and-etcd.md |
| etcd defragmentation | etcdctl defrag reclaims space inside an etcd member's MVCC store after key compaction; routine maintenance to keep DB size and latency bounded. |
11-advanced-production-patterns/08-ha-control-plane-and-etcd.md |
| Watch cache | The apiserver in-memory cache of watched-resource state that serves most LIST/WATCH from memory; defended by APF and tuned for control-plane scalability. | 11-advanced-production-patterns/09-performance-and-scalability.md |
| kube-proxy modes | iptables vs IPVS vs (CNI-replaced) eBPF: how Service VIPs become real packet steering on a node — different scalability and latency profiles. | 11-advanced-production-patterns/09-performance-and-scalability.md |
| eBPF / Cilium dataplane | An in-kernel programmable dataplane that replaces iptables-based kube-proxy and parts of CNI plumbing with eBPF programs; supports network policies, service routing, and observability with lower per-packet overhead. | 11-advanced-production-patterns/09-performance-and-scalability.md |
| Conntrack table | The kernel's connection-tracking table that iptables/IPVS Service routing depends on; an under-sized nf_conntrack_max is a classic source of connection refuseds under load. |
11-advanced-production-patterns/09-performance-and-scalability.md |
| Pod-startup latency | The end-to-end time from kubectl apply to Ready=true: schedule + image pull + container start + probes; tuned via image size, pull policy, warm caches, and topology. |
11-advanced-production-patterns/09-performance-and-scalability.md |
| Platform engineering | The discipline of building an Internal Developer Platform on top of Kubernetes so application teams get self-service + guardrails (a "paved road") without learning every primitive themselves. | 11-advanced-production-patterns/10-platform-engineering.md |
| Internal Developer Platform (IDP) | The packaged product platform teams ship to developers: a curated set of self-service abstractions backed by Kubernetes, observability, CI/CD, secrets, and policy. | 11-advanced-production-patterns/10-platform-engineering.md |
| Golden path / paved road | The opinionated default way to build, ship, and run a service on the platform — easy to follow, hard to escape; the productisation outcome of platform engineering. | 11-advanced-production-patterns/10-platform-engineering.md |
| Crossplane | A control-plane-for-infra: install Providers (AWS / GCP / Azure / Helm / Kubernetes), define a CompositeResourceDefinition (XRD) for a high-level "API" your platform offers, and back it with a Composition of provider resources; users create a claim and Crossplane reconciles cloud infra. (Also see 13 ch.02 for the BookstoreTenant Composition used to onboard a tenant in one kubectl apply.) |
11-advanced-production-patterns/10-platform-engineering.md |
| XRD / XR / claim (Crossplane) | CompositeResourceDefinition (the platform-team-authored API), XR (the cluster-scoped composite resource instance), and claim (the namespaced user-facing handle to it). |
11-advanced-production-patterns/10-platform-engineering.md |
| Composition (Crossplane) | The template that maps an XR to a set of provider resources; the unit of platform-engineer authorship. |
11-advanced-production-patterns/10-platform-engineering.md |
| Backstage | A Spotify-originated open-source developer portal: a software catalog + plugin framework + templates for scaffolding services; the canonical IDP UI. (Also see 13 ch.11 for Scaffolder + Software Catalog + TechDocs end-to-end on the v2 platform.) | 11-advanced-production-patterns/10-platform-engineering.md |
Part 12 — Kubernetes for Machine Learning¶
| Term | Definition | Covered in |
|---|---|---|
| GPU (in Kubernetes) | A device exposed to Pods via the device-plugin model as an extended resource (nvidia.com/gpu, amd.com/gpu); countable, not overcommittable, scheduled like a request but provided whole. |
12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| Device plugin | A node-local gRPC plugin (DaemonSet) the kubelet talks to (ListAndWatch + Allocate) to discover, advertise, and assign devices like GPUs / TPUs / FPGAs as extended resources. |
12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| NVIDIA GPU Operator | A meta-operator that installs and lifecycle-manages the full NVIDIA GPU stack on a cluster: driver, container toolkit, device plugin, DCGM exporter, MIG manager, and NFD/GFD labels. | 12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| NFD (Node Feature Discovery) | A DaemonSet that introspects each node (CPU features, kernel, PCI devices, …) and labels it with feature.node.kubernetes.io/* so scheduling can target capability, not just instance type. |
12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| GFD (GPU Feature Discovery) | NFD's GPU companion: applies nvidia.com/gpu.product, gpu.memory, MIG / driver-version labels — what training jobs actually pin against. |
12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| DCGM (Data Center GPU Manager) | NVIDIA's GPU telemetry stack; in Kubernetes shipped as the dcgm-exporter DaemonSet that exposes per-GPU metrics (utilization, memory, throttling, ECC) for Prometheus. |
12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| MIG (Multi-Instance GPU) | NVIDIA hardware partitioning (A100/H100) that splits one physical GPU into up to seven isolated instances; each appears to Kubernetes as a separate schedulable GPU with its own memory + SM slice. | 12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| MPS (Multi-Process Service) | NVIDIA's software CUDA-context multiplexing: multiple processes share a GPU concurrently (lower isolation than MIG, less wasted GPU); device-plugin support is opt-in. | 12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| Time-slicing (GPU) | The simplest GPU sharing — the device plugin advertises N "GPUs" per physical GPU and time-multiplexes them; no isolation, only for dev / notebook workloads. | 12-kubernetes-for-machine-learning/02-gpus-and-accelerators.md |
| Gang scheduling | All-or-nothing scheduling: either the whole set of N coordinated workers is admitted at once, or none — prevents the partial-placement deadlock that breaks distributed-ML jobs. | 12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md |
| JobSet | A SIG-batch CRD for a coordinated group of Jobs (replicated jobs, startup ordering, success/failure policies); the multi-node-training primitive that gang-scheduling layers sit on. | 12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md |
| Kueue | A Kubernetes batch-job queue manager: Workloads wait in LocalQueues that flow into ClusterQueues with quotas per ResourceFlavor; gates admission so big jobs run only when whole-job resources are available. |
12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md |
| ResourceFlavor / ClusterQueue / LocalQueue / Workload | Kueue CRDs: a flavor is a labeled / tainted node class (e.g. GPU A100 spot); a ClusterQueue sets quota over flavors and borrowing; a LocalQueue is the per-namespace entry point; a Workload is the queued job. | 12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md |
| Volcano | A batch / gang-scheduling system originally born of Kube-batch; an alternative scheduler (schedulerName: volcano) with PodGroup semantics, queues, and fair-share — popular in HPC/ML shops. |
12-kubernetes-for-machine-learning/03-batch-and-gang-scheduling.md |
| Kubeflow Training Operator | An operator providing per-framework training CRDs that fan out and coordinate workers: PyTorchJob, TFJob, MPIJob, PaddleJob, XGBoostJob. Standard for multi-worker training on Kubernetes. |
12-kubernetes-for-machine-learning/04-distributed-training.md |
| PyTorchJob / TFJob / MPIJob / PaddleJob | Training-operator CRDs that declare master/worker replica counts, the framework's rendezvous wiring, and shared envs; the operator handles failure semantics + cleanup. | 12-kubernetes-for-machine-learning/04-distributed-training.md |
| KubeRay | The Kubernetes operator for Ray: RayCluster (a head + worker Pods), RayJob (submit a job + cluster lifecycle), RayService (online serving). The canonical Ray-on-Kubernetes deployment. |
12-kubernetes-for-machine-learning/04-distributed-training.md |
| RayCluster / RayJob / Ray Train | KubeRay objects + library: RayCluster is the long-lived head+workers; RayJob is a transient job + ephemeral cluster; Ray Train is Ray's distributed-training library (Torch/TF/XGBoost backends). |
12-kubernetes-for-machine-learning/04-distributed-training.md |
torchrun / Horovod / NCCL |
Distributed-training launchers / libraries: torchrun is PyTorch's rendezvous + worker launcher; Horovod is a multi-framework allreduce library (deprecated by many for native DDP/FSDP); NCCL is NVIDIA's GPU-to-GPU collective comm primitive (allreduce / allgather). |
12-kubernetes-for-machine-learning/04-distributed-training.md |
| Allreduce / rendezvous | The two pillars of distributed training: rendezvous = workers discover each other and pick a master at startup; allreduce = the synchronous gradient-aggregation collective that all workers participate in each step. | 12-kubernetes-for-machine-learning/04-distributed-training.md |
cleanPodPolicy |
Kubeflow Training Operator field controlling whether worker Pods are deleted after a job finishes (All / Running / None); useful for log retention vs scheduler turnover. |
12-kubernetes-for-machine-learning/04-distributed-training.md |
| Elastic training | Training that tolerates workers joining / leaving mid-run (PyTorch Elastic / torch.distributed.elastic) — used on spot-heavy clusters. | 12-kubernetes-for-machine-learning/04-distributed-training.md |
| Checkpointing | Periodically saving training state to a PV or object store so a job can resume after a worker / node loss; the "make distributed training cheap on spot" trick. | 12-kubernetes-for-machine-learning/04-distributed-training.md |
| JupyterHub on Kubernetes (z2jh) | Zero-to-JupyterHub: the official Helm-deployed JupyterHub stack — hub + configurable-http-proxy + per-user singleuser server Pods spawned by KubeSpawner. |
12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md |
| KubeSpawner | The JupyterHub spawner that creates per-user singleuser server Pods (+ PVC) from a profile selection; the bridge between hub auth and Kubernetes. | 12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md |
singleuser server / profileList |
The per-user notebook Pod, and the hub config that offers an image + resource menu (CPU notebook / small GPU / large GPU / R / Spark) the user picks at spawn time. | 12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md |
| Kubeflow Notebooks | The Kubeflow-native alternative to JupyterHub: Notebook CRD + controller that creates per-user notebook Pods inside Kubeflow's profile-scoped namespaces. |
12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md |
| Data gravity | The principle that compute should run where the data lives because moving large datasets dominates cost / latency / egress; drives co-location of training and storage. | 12-kubernetes-for-machine-learning/05-notebooks-and-interactive.md |
| KServe | A Kubernetes-native model-serving platform: InferenceService + ServingRuntime deliver low-latency inference with autoscaling (Knative-serverless or raw Deployment), canary/shadow, transformer + explainer + predictor pipeline. |
12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
InferenceService |
The user-facing KServe CRD: declares model URI, framework (sklearn/tensorflow/pytorch/triton/…), traffic split, canary, and the runtime / predictor shape. | 12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
ServingRuntime |
A KServe CRD that defines a reusable per-framework runtime (container image, supported model formats, resource defaults); separates the "how to serve" from "what to serve". | 12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| Knative Serving (KServe serverless mode) | KServe's default mode: built on Knative Serving's autoscaler — scale-to-zero, request-load-driven scaling, revision history. | 12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| RawDeployment mode (KServe) | KServe's alternative mode: plain Deployment + HPA (no Knative), used when scale-to-zero is unwanted or Knative is not installed. |
12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| Predictor / transformer / explainer | The three optional Pods in a KServe InferenceService: the predictor (the model), the optional transformer (pre/post-processing), and the optional explainer (interpretability). |
12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| Seldon Core | A sibling model-serving platform with a richer DAG/pipeline model (SeldonDeployment / Core v2 Model+Pipeline); referenced as an alternative to KServe. |
12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| NVIDIA Triton Inference Server | A high-performance multi-framework inference server (TensorRT / PyTorch / ONNX / Python backend, dynamic batching, ensembles) used as the ServingRuntime behind KServe / Seldon for GPU inference. |
12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| Model canary / A-B / shadow | Progressive delivery at the model layer: a small traffic fraction to the new model (canary), an explicit split between models (A-B), or a copy of traffic without using its response (shadow) — KServe's canaryTrafficPercent is the canonical knob. |
12-kubernetes-for-machine-learning/06-model-serving-and-inference.md |
| Argo Workflows | A CNCF Kubernetes-native workflow engine: Workflow (one run), WorkflowTemplate / ClusterWorkflowTemplate (reusable definitions), CronWorkflow (scheduled); DAG or Steps templates, parameter + artifact passing, retries, artifact stores (PVC / S3 / GCS / Azure). |
12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
WorkflowTemplate / ClusterWorkflowTemplate |
Argo Workflows CRDs holding reusable workflow definitions (namespaced or cluster-scoped) referenced from Workflows via workflowTemplateRef. |
12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
CronWorkflow |
A cron-scheduled Argo Workflow; the "nightly retrain at 02:00" / "hourly batch inference" primitive. | 12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
| Argo Events | The event-driven side of Argo: EventBus + EventSource + Sensor that trigger Workflows from S3 puts, Kafka, GitHub webhooks, schedules, etc. — turns pipelines into reactive systems. |
12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
| Kubeflow Pipelines (KFP) | The ML-pipeline DSL + backend in Kubeflow: pipelines defined in a Python DSL, compiled to a pipeline spec, executed by an orchestrator (the v2 backend uses Argo Workflows). KFP v1 vs v2 differ in SDK + IR / metadata. |
12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
| Tekton Pipelines | A Kubernetes-native CI/CD pipeline engine (Task / Pipeline / PipelineRun); a sibling option to Argo Workflows often used for CI rather than ML pipelines (mentioned for contrast). |
12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
| Katib | Kubeflow's hyperparameter-tuning operator: Experiment (the search), Suggestion (the proposed candidates from the algorithm), Trial (each candidate run). Supports grid / random / Bayesian / HyperBand / NAS. |
12-kubernetes-for-machine-learning/07-ml-pipelines-and-workflows.md |
| MLflow | An MLOps tracking + model-registry tool: mlflow.log_* records params / metrics / artifacts per run; the registry promotes models through stages (None / Staging / Production / Archived) — the "where do my trained models go" answer. (Also see 13 ch.08 for the full closed loop: MLflow Registry → KServe canary → Alibi-Detect drift → Argo Events retrain.) |
12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Model registry | A versioned store of trained model artifacts + metadata + lineage + stage (e.g. MLflow Registry, KServe-integrated registries) — the source-of-truth a serving system pulls from. | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| ML lineage | The traceable graph of data version → code commit → training run → metrics → produced model → serving deployment; captured via MLflow + KFP metadata so an incident on a prod model can be traced back to its inputs. |
12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Model drift / data drift | Distributions in production diverging from training: data drift = input feature distribution moves; model drift = predictive performance degrades. Triggers retraining. (Also see 13 ch.08 for the drift-detected → Argo Events Sensor → retrain Workflow loop.) | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Alibi-Detect / Evidently | Open-source drift / outlier detection libraries; deployed in-cluster (often as a KServe component or a sidecar) to compute drift metrics over a production traffic window. | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| MLOps maturity (L0–L3) | The Google MLOps maturity model: L0 = manual notebook; L1 = ML code in CI/CD; L2 = ML pipeline in CI/CD with automated training; L3 = full continuous training + monitoring + auto-retraining. | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| OpenCost / Kubecost | The CNCF cost-allocation tool (OpenCost = spec + open-source; Kubecost = commercial product around it) that attributes cluster spend (CPU / memory / GPU / PV / network) to namespaces / labels / workloads — drives per-tenant cost. (Also see 13 ch.10 for the per-tenant + per-cluster + per-region story plus showback-vs-chargeback + FinOps maturity.) | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Per-tenant cost (ns = team = cost center) | The cost-allocation discipline: one namespace per team / project, labels team / cost-center on every workload, OpenCost rolls up the bill — the cleanest path to chargeback / showback. |
12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Feature store (Feast / Tecton) | A serving + offline store of curated features (low-latency online lookups in serving + batch joins in training) — solves train/serve skew. Mentioned in Part 12 ch.08 "what we didn't build". | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Data versioning (DVC / LakeFS / Pachyderm) | Tools that version datasets the way Git versions code, so a training run's inputs are reproducible; mentioned in Part 12 ch.08 "what we didn't build". | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
| Federated learning (Flower / FedML) | A training paradigm where models are trained on decentralized data without aggregating it (each client trains locally, only weights / gradients are shared); mentioned as a frontier in Part 12 ch.08. | 12-kubernetes-for-machine-learning/08-ml-platform-cost-and-mlops.md |
Part 13 — Grand Capstone: Bookstore Platform v2 terms¶
| Term | Definition | Covered in |
|---|---|---|
| Bookstore Platform v2 | The Part 13 production artifact: the v1 Bookstore re-shaped along seven dimensions (multi-tenant + multi-region active-active + Keycloak OIDC + IRSA + mesh JWT + CDC-driven search + Kafka outbox payments + edge WAF + closed ML loop + three-pillar OTel + OpenCost FinOps + Backstage IDP + day-2 runbook); the production reality v1 was the teaching artifact for. | 13-grand-capstone-bookstore-platform/01-bookstore-2-from-toy-to-platform.md |
| Tenant (v2 sense) | A bookstore owner on the v2 platform — concretely a namespace + a Kueue queue + per-tenant cloud resources (S3 bucket, RDS read-replica) + a Backstage catalog entry + a Crossplane claim — onboarded in one kubectl apply of a BookstoreTenant claim. |
13-grand-capstone-bookstore-platform/02-tenancy-and-crossplane-onboarding.md |
BookstoreTenant Composition |
The Crossplane Composition that backs the BookstoreTenant XR/claim: one declarative apply fans out into a Namespace, ResourceQuota, LocalQueue, Argo CD Application, S3 bucket, RDS read-replica, IRSA Role, Backstage Component, and observability dashboards. Builds on 11 ch.10. |
13-grand-capstone-bookstore-platform/02-tenancy-and-crossplane-onboarding.md |
| Active-active multi-region | A topology where N regions all serve user traffic concurrently (vs active-passive failover); each region runs the full stack, data replicates between them, DNS routes by latency, and a region failure causes only a brief DNS shift. | 13-grand-capstone-bookstore-platform/03-multi-region-active-active.md |
CloudNativePG ReplicaCluster |
The CloudNativePG CRD for cross-region streaming replication: one region runs the primary Cluster, others run ReplicaClusters that follow it via Postgres streaming replication; promotion is a controlled spec edit. |
13-grand-capstone-bookstore-platform/03-multi-region-active-active.md |
| Latency-based DNS / region affinity | DNS records (Route 53 latency policy / Cloud DNS load-balancing / Azure Traffic Manager performance routing) that resolve each client to the lowest-latency healthy region; the user-facing piece of active-active. | 13-grand-capstone-bookstore-platform/03-multi-region-active-active.md |
| Keycloak | An open-source OIDC + SAML identity provider with realms, clients, users, groups, federation (LDAP / social), and an admin console; the v2 platform's IdP for human auth. | 13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md |
| OIDC code+PKCE flow | The OAuth 2.0 / OIDC flow used by browser + mobile + SPA clients: client redirects to IdP with a PKCE code-verifier hash; IdP authenticates the user and returns an authorization code; client exchanges code + verifier for tokens — no client secret in the browser. | 13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md |
| JWKS (JSON Web Key Set) | The IdP's published set of public signing keys at a well-known URL (/.well-known/jwks.json); verifiers (Istio, API gateways, services) fetch and cache JWKS to validate JWT signatures without sharing a secret with the IdP. |
13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md |
RequestAuthentication (Istio) |
The Istio CRD that declares which JWT issuers + JWKS endpoints the mesh accepts on traffic to a workload; failed-validation requests get 401; verified claims become request attributes for AuthorizationPolicy. |
13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md |
AuthorizationPolicy (Istio) |
The Istio CRD that allows / denies / logs requests based on source identity (mTLS principal), JWT claims, method/path, and headers; the mesh's L7 authZ checked after RequestAuthentication validates the token. |
13-grand-capstone-bookstore-platform/04-real-auth-keycloak-irsa-istio-jwt.md |
| Meilisearch | An open-source full-text search engine with typo tolerance, faceting, and synonyms; deployed on Kubernetes as the v2 product-discovery backend, indexed from Postgres via Debezium CDC. | 13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md |
| Debezium | An open-source CDC platform that reads database transaction logs (Postgres logical replication / MySQL binlog / MongoDB oplog) and emits row-level change events to Kafka; the canonical Postgres → Kafka bridge for the outbox pattern. | 13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md |
| CDC (Change Data Capture) | The pattern of streaming row-level inserts / updates / deletes out of a database as events, typically from its WAL / binlog — moves data without dual-writes or polling, and turns the DB into the source of truth for downstream consumers (search index, analytics, caches). | 13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md |
| Strimzi | The Kubernetes operator for Apache Kafka: CRDs for Kafka (cluster), KafkaTopic, KafkaUser, KafkaConnect, KafkaConnector, KafkaBridge, KafkaMirrorMaker2; the canonical "Kafka on Kubernetes" install used in v2. |
13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md |
KafkaConnect (Strimzi CRD) |
A Strimzi CRD that runs Kafka Connect workers in a cluster; the runtime that hosts source / sink connectors (e.g. Debezium Postgres source, Meilisearch sink). | 13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md |
KafkaConnector (Strimzi CRD) |
The Strimzi CRD declaring one connector instance (class + config) inside a KafkaConnect cluster: e.g. io.debezium.connector.postgresql.PostgresConnector for CDC ingest, a Meilisearch sink for indexing. |
13-grand-capstone-bookstore-platform/05-search-and-product-discovery.md |
| Outbox pattern | A reliable cross-system event-emission pattern: the same DB transaction that mutates business state also inserts a row into an outbox table; a CDC process (Debezium) streams those rows to Kafka. Avoids the dual-write inconsistency between DB and message bus. |
13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md |
| Saga compensation | The compensating-transaction half of the saga pattern: when a multi-step distributed workflow fails partway, each completed step has a published "undo" step (refund a payment, release stock, cancel a shipment) that is run in reverse order; replaces 2PC across services. | 13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md |
| Stripe sandbox / webhook signature verification | Stripe's test-mode environment (sk_test_… keys, tok_visa etc.) and the obligatory verification step on inbound webhooks: every webhook is signed with the endpoint's secret and an HMAC over the body+timestamp; the v2 payments-worker rejects any webhook without a valid Stripe-Signature within a 5-minute window. |
13-grand-capstone-bookstore-platform/06-payments-and-event-sourcing.md |
HTTPRoute (Gateway API) |
The Gateway API CRD that attaches HTTP routing rules (path / header / method matches, weighted backends, filters, timeouts) to a Gateway listener; the L7 routing primitive at the v2 edge. |
13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md |
| Coraza | An open-source Go re-implementation of the ModSecurity rules engine; runs inside Envoy / Caddy / nginx as a WAF that loads OWASP CRS rules and produces audit logs in the ModSec format. The v2 edge plugs Coraza into Istio as a Wasm filter. | 13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md |
| WAF (Web Application Firewall) | An L7 inspection layer that blocks common attacks (SQLi, XSS, path traversal, scanner fingerprints) before requests reach the app; in v2 deployed as Coraza+OWASP CRS at the Istio edge with a per-rule anomaly-score threshold. | 13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md |
| OWASP CRS (Core Rule Set) | The canonical open-source WAF rule set maintained by OWASP: signatures + anomaly-score rules for SQL injection, XSS, LFI/RFI, RCE, scanners, and protocol violations; consumed by ModSecurity and Coraza. | 13-grand-capstone-bookstore-platform/07-edge-gateway-waf-rate-limiting.md |
| MLflow Model Registry | MLflow's versioned model store layered on top of run-tracking: each registered model has versions in stages (None / Staging / Production / Archived) and webhook transitions; KServe InferenceServices pin against the registry URI rather than a raw artifact path. |
13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md |
| Alibi-Detect | An open-source drift / outlier / adversarial-example detection library (Seldon project); deployed in v2 as a sidecar that computes drift scores over a rolling production window and fires a Kafka event when a threshold is breached. | 13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md |
| Drift detection (data / model / concept) | Continuous monitoring of model inputs and outputs for distribution shift: data drift = input features move; model drift = prediction-quality metrics degrade against a holdout; concept drift = the relationship between inputs and the true target changes — each is detected differently and each demands a retrain. | 13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md |
InferenceService canary (canaryTrafficPercent) |
The KServe field on InferenceService.spec.predictor that routes a percentage of inference traffic to the new revision while the rest stays on the previous one; promoted by raising the percentage or rolled back by clearing the canary spec. |
13-grand-capstone-bookstore-platform/08-real-ml-loop-training-registry-serving-drift.md |
| OpenTelemetry Collector | The vendor-neutral OTel pipeline daemon: receivers (OTLP / Prometheus / Jaeger / etc.), processors (batch / attributes / tail-sampling), and exporters (OTLP / Prometheus remote-write / Tempo / Loki) — deployed in v2 as a DaemonSet (agent) + a Deployment (gateway). | 13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md |
| OTLP (OpenTelemetry Protocol) | The native wire protocol of OpenTelemetry (gRPC or HTTP/Protobuf, ports 4317 / 4318) that carries traces, metrics, and logs from SDKs and Collectors; the v2 standard for emit + collector-to-collector. | 13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md |
| Grafana Tempo | An open-source, horizontally scalable, object-storage-backed trace store (CNCF / Grafana Labs) that ingests OTLP traces and serves them to Grafana via TraceQL; the v2 trace backend. | 13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md |
| Grafana Loki | An open-source log aggregation system (Grafana Labs) that indexes only labels (not log content) and stores compressed chunks in object storage; the v2 log backend, queried via LogQL. | 13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md |
| Grafana variable templating | Grafana's dashboard variables ($tenant, $region, $service, …) that turn one dashboard into N — a single panel definition rendered per tenant or per region by pivoting on the selected variable, so the v2 platform team builds N dashboards once, not N times. |
13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md |
| Alertmanager inhibition | Prometheus Alertmanager's rule that one firing alert silences others (e.g. a RegionDown alert inhibits every per-service alert from that region) — prevents pager storms when one root cause fans out into many symptoms. |
13-grand-capstone-bookstore-platform/09-observability-otel-tempo-loki-prometheus-grafana.md |
| FinOps Foundation maturity (Inform / Optimize / Operate) | The three-phase maturity model from the FinOps Foundation: Inform = accurate cost allocation + visibility; Optimize = right-sizing, spot, savings plans, idle removal; Operate = governance, budgets, automation, accountability. The v2 cost chapter ladders through all three. | 13-grand-capstone-bookstore-platform/10-cost-opencost-per-tenant-finops.md |
| Showback vs chargeback | Two cost-allocation postures: showback shows each team its bill but doesn't transfer money (information only); chargeback actually moves budget between teams. Most platforms start with showback to build trust, graduate to chargeback once allocation is provably right. | 13-grand-capstone-bookstore-platform/10-cost-opencost-per-tenant-finops.md |
| Backstage Software Catalog | Backstage's central registry of components / APIs / resources / systems / domains: each entity is a YAML file in Git, ingested by a LocationEntityProvider or ScaffolderEntityProvider; the v2 catalog is seeded from Argo CD Applications + Crossplane claims. |
13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md |
| Backstage Scaffolder | The Backstage subsystem that runs templated "create a new X" flows: a user fills a form (service name, owner, repo, tier) and the Scaffolder runs steps (fetch template, render, push to Git, register Component, kick CI) — the v2 golden path for new microservices. |
13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md |
| Backstage TechDocs (MkDocs integration) | Backstage's docs-as-code subsystem: each Component's repo carries a mkdocs.yml + docs/; the TechDocs builder renders MkDocs into static HTML and Backstage serves it inline against the catalog entity. |
13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md |
| Backstage plugin model | Backstage as a Node.js app with a typed plugin API: frontend plugins are React extensions registered into the app's routing/sidebar; backend plugins are Express-style routers; built-in plugins integrate Argo CD, Kubernetes, GitHub, Prometheus, PagerDuty, costs. The v2 platform composes the IDP from official plugins. | 13-grand-capstone-bookstore-platform/11-backstage-developer-portal-idp.md |
| Runbook | The fixed-shape on-call artifact for one alert: page → check (the four things) → diagnose (the symptom tree) → mitigate (the smallest action that restores serve) → postmortem (the follow-up); v2 ships a runbook per alert, all referenced from the alert annotation. | 13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md |
| On-call rotation (primary / secondary) | The duty schedule: a primary engineer carries the pager for the week; a secondary backs them up (acks if primary doesn't in N minutes, joins long incidents); follow-the-sun rotations spread the load across timezones — v2 codifies the structure, not just the schedule. | 13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md |
| DR drill / RTO / RPO | The rehearsed disaster-recovery exercise plus its two targets: RTO (recovery time objective) = how long until the service is back up; RPO (recovery point objective) = how much data loss is acceptable. v2 runs a monthly DR drill measuring both. | 13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md |
| Chaos game-day | A scheduled, blast-radius-bounded chaos-engineering exercise: a hypothesis is declared (e.g. "killing region us-east-1 shifts traffic in <2 minutes"), Chaos Mesh experiments run, observability is watched, results feed back into the runbook. v2 schedules one quarterly. | 13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md |
| Blameless postmortem | The post-incident document that explains what happened, in what timeline, with what impact, with what contributing factors, and with action items owned by a name and a date — without naming-and-blaming individuals; the only kind of postmortem an SRE org publishes. Also see 15 ch.10 for the production-grade 48h-draft / 5-day-publish discipline. | 13-grand-capstone-bookstore-platform/12-day-2-runbook-on-call-dr-chaos.md |
Part 14 — EKS in Production: A-Z terms¶
| Term | Definition | Covered in |
|---|---|---|
use_lockfile = true |
Terraform 1.10+ native S3 backend state-locking flag: instead of the legacy DynamoDB lock table, the lock is a sibling *.tflock object next to the state file in S3 — one fewer cloud resource to provision, and no IAM glue between the state bucket and a lock table. |
14-eks-in-production-a-to-z/01-terraform-state-in-production.md |
| Bucket bootstrap pattern | The chicken-and-egg fix for Terraform-managing-its-own-state: a tiny bootstrap-state.sh (AWS CLI, not Terraform) creates the S3 bucket + KMS key + versioning + lifecycle once; thereafter the bucket holds its own state and Terraform takes over. |
14-eks-in-production-a-to-z/01-terraform-state-in-production.md |
terraform state commands |
The four state-surgery subcommands every operator needs at least once: list (enumerate), show (inspect), rm (drop without destroying the cloud resource), and mv / replace-provider (relocate or rebind addresses) — used when a refactor or provider migration outpaces a plain apply. |
14-eks-in-production-a-to-z/01-terraform-state-in-production.md |
| Terraform workspaces (vs Terragrunt vs separate roots) | Three ways to slice one Terraform tree into many environments: built-in workspaces (one root, many state files, shared variables) vs Terragrunt (a wrapper that DRY-templates separate roots with locked module versions) vs separate root modules (one directory per environment, fully duplicated, fully isolated); the chapter picks separate roots for blast-radius isolation. | 14-eks-in-production-a-to-z/01-terraform-state-in-production.md |
| EKS standard support window | The 14-month period during which AWS patches an EKS minor version at the published k8s SLA (security CVEs, kube-apiserver/kubelet bugfixes) at the base $0.10/cluster-hour control-plane price; ends with the version's standard-support EOL, after which extended-support pricing kicks in automatically. | 14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md |
| EKS extended support window | The 12-month grace period AWS offers after standard-support EOL: the cluster keeps running and receiving security patches but billed at $0.50/cluster-hour (~$365/month surcharge per cluster), giving teams a hard deadline to upgrade before the version is forcibly retired. | 14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md |
| Blue-green cluster pattern | The N+1-version skip pattern: instead of in-place upgrading a single cluster through every minor (1.27 → 1.28 → 1.29 → 1.30), stand up a new cluster on the target version, validate workloads on it, shift traffic via DNS / load balancer, then tear down the old; the only safe path when you've fallen many versions behind. | 14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md |
kubent (kube-no-trouble) |
An open-source CLI that scans a live cluster (or a directory of manifests) for API resources whose apiVersion is deprecated or removed in upcoming Kubernetes minors; the pre-upgrade gate that catches "this still works on 1.28 but will break on 1.30" before the in-place bump. |
14-eks-in-production-a-to-z/02-eks-cluster-lifecycle.md |
| EKS managed addon | AWS's managed lifecycle for a cluster-critical addon (vpc-cni, kube-proxy, coredns, aws-ebs-csi-driver, aws-efs-csi-driver, etc.): AWS picks compatible versions per EKS minor, applies updates on request, and gates conflicts via the resolve_conflicts_on_create / _on_update policy — declared in Terraform as aws_eks_addon, not as a Helm release. |
14-eks-in-production-a-to-z/03-eks-addon-management.md |
before_compute = true |
The Terraform AWS EKS module flag that forces the VPC-CNI addon to install before the first managed node group's nodes register; without it, nodes can become Ready before the CNI is installed, Pods get assigned to nodes with no network, and the cluster spends an hour in CrashLoopBackOff. | 14-eks-in-production-a-to-z/03-eks-addon-management.md |
resolve_conflicts_on_create / _on_update |
EKS addon flags that decide what happens when an addon's config differs from AWS's defaults: OVERWRITE (AWS wins, replaces local edits) vs PRESERVE (local edits win); the difference between "kubectl edit ConfigMap aws-node survives the addon update" and "kubectl edit was silently overwritten on next reconcile". |
14-eks-in-production-a-to-z/03-eks-addon-management.md |
| gp3 default StorageClass | The production-grade EBS volume type that should replace EKS's gp2 default the day a cluster lands: 20% cheaper at the same baseline, IOPS and throughput tunable independently of volume size, no burst-credit footgun; the chapter ships the Terraform / kubectl patch that swaps the default annotation. | 14-eks-in-production-a-to-z/04-storage-classes-and-ebs.md |
WaitForFirstConsumer |
The StorageClass volumeBindingMode that defers PV provisioning until a Pod referencing the PVC is scheduled to a node; on EKS this keeps the EBS volume in the same AZ as the Pod, preventing the cross-AZ-mount failure mode of Immediate binding. |
14-eks-in-production-a-to-z/04-storage-classes-and-ebs.md |
VolumeSnapshotContent (cluster-scoped) vs VolumeSnapshot (namespaced) |
The Kubernetes snapshot API's two halves: VolumeSnapshot is the namespaced user-facing request ("snapshot my PVC"), and VolumeSnapshotContent is the cluster-scoped representation of the actual cloud snapshot — exactly the PV/PVC split, one level up. |
14-eks-in-production-a-to-z/04-storage-classes-and-ebs.md |
| AWS Budgets actions | The AWS Budgets feature that goes beyond email alerts: a breached budget can fire an SNS topic, attach a deny-IAM policy, stop EC2 instances, or trigger a Step Function — the only way to make a budget a hard guardrail instead of an after-the-fact alarm. | 14-eks-in-production-a-to-z/06-cost-guardrails.md |
| infracost | An open-source CLI that turns a terraform plan JSON into a per-resource USD cost estimate against current cloud pricing, and a GitHub Action that posts a diff comment on every Terraform PR — the CI-side guardrail that catches a 50-NAT-gateway accidental fan-out before it merges. |
14-eks-in-production-a-to-z/06-cost-guardrails.md |
| OIDC trust (GitHub Actions for AWS) | The OIDC federation pattern that lets a GitHub Actions workflow assume an AWS IAM role with no long-lived access keys: GitHub mints a short-lived OIDC token, AWS STS exchanges it for temporary credentials, and a trust policy on the role pins the GitHub org / repo / branch / environment that may assume it. | 14-eks-in-production-a-to-z/07-infrastructure-cicd-and-drift.md |
| driftctl | An open-source drift-detection CLI that compares Terraform state against the actual cloud inventory and flags unmanaged + drifted + missing resources; the cluster-wide drift answer when terraform plan -detailed-exitcode only sees the resources in this state file. |
14-eks-in-production-a-to-z/07-infrastructure-cicd-and-drift.md |
| Atlantis | An open-source Terraform CI server that runs terraform plan on every PR, posts the plan as a PR comment, and gates apply behind a /atlantis apply PR comment + branch protection — the self-hosted alternative to Terraform Cloud / Spacelift for plan-on-PR + apply-on-merge. |
14-eks-in-production-a-to-z/07-infrastructure-cicd-and-drift.md |
| VPC Gateway endpoint vs Interface endpoint | Two shapes of AWS PrivateLink: Gateway endpoints (S3, DynamoDB) are free, route through the VPC route table, and skip NAT entirely; Interface endpoints (ECR, STS, EC2, CloudWatch Logs, KMS, …) cost $0.01/hr per AZ + $0.01/GB processed and use private ENIs in your subnets. | 14-eks-in-production-a-to-z/08-vpc-endpoints-and-egress.md |
| Graviton (AWS arm64) | AWS's ARM-based EC2 instance line (c7g / m7g / r7g, and successors); typically ~20% cheaper than the equivalent x86 instance at the same SLA, identical from a Kubernetes perspective — provided every container image runs an arm64 manifest. |
14-eks-in-production-a-to-z/09-arm-graviton-on-eks.md |
Multi-arch container image (docker buildx --platform) |
A single image tag whose OCI manifest is a list pointing at per-platform variants (linux/amd64, linux/arm64/v8); docker buildx build --platform linux/amd64,linux/arm64 produces it. The disciplinary requirement for Graviton + x86 in the same cluster. |
14-eks-in-production-a-to-z/09-arm-graviton-on-eks.md |
| App-of-Apps (Argo CD) | A bootstrap pattern where a single root Argo CD Application reconciles a directory of other Application manifests, each of which reconciles a real workload; the GitOps version of "an array of arrays" that lets one Application govern dozens of children. Also see 14 ch.10 for the EKS bootstrap path. |
07-delivery/04-gitops-argocd.md |
| Argo CD self-management | The GitOps loop where Argo CD's own manifests (Helm release values, projects, RBAC) live in Git and are reconciled by Argo CD itself; the second-stage payoff of the Terraform bootstrap — after terraform apply installs Argo CD once, the same Argo CD adopts itself and Terraform never touches it again. |
14-eks-in-production-a-to-z/10-gitops-bootstrap-fresh-cluster.md |
| Route 53 latency-based routing | Route 53's record type that returns the IP of the lowest-latency healthy region for each resolver (measured continuously by AWS); the DNS-side primitive of cloud active-active, with TTL choosing how quickly clients shift on a regional failure. | 14-eks-in-production-a-to-z/11-multi-region-active-active-cloud.md |
| AWS Global Accelerator | An AWS edge product that anycast-publishes two static IPs in the AWS global network and steers clients to the nearest healthy region via the AWS backbone; reduces failover time from DNS-TTL (60s) to ~30s and improves jitter, at $0.025/hr + transfer per accelerator. | 14-eks-in-production-a-to-z/11-multi-region-active-active-cloud.md |
CNPG ReplicaCluster (cloud reality) |
The cloud-deployed shape of Part 13 ch.03's pattern: CloudNativePG Cluster in region A acts as primary, ReplicaClusters in regions B + C stream from it over Transit Gateway peering, promotion is a controlled spec.replica.enabled: false toggle — measured here against real RTO / RPO numbers. Also see 13 ch.03 for the kind-local shape. |
14-eks-in-production-a-to-z/11-multi-region-active-active-cloud.md |
| Cosign keyless signing | The Sigstore signing path that uses OIDC short-lived certs (via Fulcio) and a transparency log (Rekor) instead of a long-lived signing keypair: the workflow's OIDC identity is the cert subject, the cert is good for ~10 minutes, the Rekor entry is permanent; no key to rotate, no key to lose. Also see 15 ch.03. | 14-eks-in-production-a-to-z/12-supply-chain-security.md |
| syft (SBOM generation) | An open-source CLI (Anchore) that scans a container image / filesystem / source tree and emits an SBOM in SPDX-JSON or CycloneDX-JSON; the SBOM is then bound to the image digest with cosign attest so admission can verify provenance + content together. |
14-eks-in-production-a-to-z/12-supply-chain-security.md |
| grype (CVE scanner) | The companion to syft (Anchore): takes an SBOM or an image and emits a CVE report scoped to actually-installed package versions; used in CI to fail builds when a Critical/High CVE crosses a policy threshold. | 14-eks-in-production-a-to-z/12-supply-chain-security.md |
| SLSA framework | The "Supply chain Levels for Software Artifacts" framework from Google / OpenSSF: four levels (L1–L4) of increasing guarantee that a build's source, build process, and provenance are unfalsifiable; L3 is the practical target for production CI/CD (signed provenance + hermetic build + isolated runner). Also see 15 ch.03. | 14-eks-in-production-a-to-z/12-supply-chain-security.md |
| ECR enhanced scanning | AWS ECR's premium scanning tier (Inspector-powered): continuous CVE scans of pushed images, OS + language-package coverage, results published to Inspector findings; $0.09/image/month, vs the free "basic" tier that scans once on push only. | 14-eks-in-production-a-to-z/12-supply-chain-security.md |
Kyverno verifyImages |
The Kyverno ClusterPolicy rule type that gates admission on cosign signatures: matches an image (registry / repo / tag glob) against an expected OIDC issuer + subject regex, rejects on mismatch; the production gate that turns "we sign images" into "unsigned images cannot run". |
14-eks-in-production-a-to-z/12-supply-chain-security.md |
| Falco (eBPF driver) | An open-source CNCF runtime-security tool: kernel-level system-call observer (modern eBPF driver replaces the old kernel module) + a YAML rules language (falco_rules.yaml) that fires on policy violations ("a shell spawned in a container", "writes to /etc/") with severity + tags. |
14-eks-in-production-a-to-z/13-runtime-defense-and-container-security.md |
| Tetragon | Cilium-project runtime-security tool: pure-eBPF, kernel-attached TracingPolicy CRD that filters and (optionally) enforces on syscall events; lower overhead than Falco for high-volume rule sets, and can block in-kernel rather than only log. |
14-eks-in-production-a-to-z/13-runtime-defense-and-container-security.md |
| GuardDuty for EKS (Audit + Runtime) | AWS's managed threat-detection for EKS clusters: Audit Log Monitoring ingests EKS control-plane audit logs and flags suspicious API patterns; Runtime Monitoring runs an in-cluster agent that observes process / file / network behaviour; both produce GuardDuty findings priced per-finding. | 14-eks-in-production-a-to-z/13-runtime-defense-and-container-security.md |
| Velero BSL / VSL / Schedule / Kopia | The four moving pieces of a Velero install: BSL (BackupStorageLocation, the object-store bucket for API-object dumps), VSL (VolumeSnapshotLocation, the cloud-snapshot configuration for PVs), Schedule (a CronJob-shaped Schedule CRD), and Kopia (the default content-addressable de-duplicating uploader). Also see 08 ch.02 for the foundational Velero concepts. |
14-eks-in-production-a-to-z/14-backup-and-restore-velero.md |
| Cilium native routing | Cilium's no-overlay mode where Pod traffic is routed directly through the VPC routing table (one IP per Pod, real VPC reachability) rather than via VXLAN encapsulation; the EKS-flavoured Cilium install when the goal is wire-speed and VPC-Flow-Logs visibility. | 14-eks-in-production-a-to-z/15-cilium-ebpf-on-eks.md |
| Hubble (Cilium observability) | The Cilium project's flow-level observability layer: every L3/L4 + L7 flow Cilium handles is exposed as a structured event (identity-aware, not just IP-aware) consumable by hubble observe, the Hubble UI, or Hubble's metric exporter; the visibility VPC Flow Logs cannot give you. |
14-eks-in-production-a-to-z/15-cilium-ebpf-on-eks.md |
| Telepresence (personal-intercept) | An open-source dev-loop tool: redirects a single deployment's traffic from a real cluster Pod to a process running on the developer's laptop ("personal intercept"); the developer debugs locally while the request still hits real cluster dependencies. | 14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md |
| Mirrord (mirror / steal modes) | A dev-loop tool from MetalBear: mirror mode copies traffic to a laptop process for read-only debugging (production-safe); steal mode redirects traffic for full request-response handling (interactive debugging); the more lightweight cousin of Telepresence. | 14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md |
| Skaffold (sync mode) | A Google dev-loop CLI: watches source, rebuilds + redeploys on change; sync mode copies edited files directly into a running container (skipping the Docker build entirely) for static-asset / interpreted-language inner loops. | 14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md |
| Tilt | A Tiltfile-driven dev-loop orchestrator (Starlark config); watches files, rebuilds + redeploys, and exposes a single dashboard with live logs, build status, and pod health across the whole micro-architecture; the multi-service complement to Skaffold's single-service focus. | 14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md |
| Devcontainer | The open Devcontainer Specification (Microsoft / VS Code): a devcontainer.json declaring the IDE-in-a-container — base image, features, post-create commands, port forwards — so every developer gets the same toolchain regardless of laptop OS; also runs in GitHub Codespaces. |
14-eks-in-production-a-to-z/16-developer-experience-for-k8s-teams.md |
| AWS Config conformance pack | An AWS Config feature that bundles a curated set of Config rules + remediation actions into a deployable YAML; the chapter ships an EKS-shaped pack covering encryption-at-rest, public-access prevention, IRSA-only workloads, and tag governance — the "audit baseline" you apply once per account. | 14-eks-in-production-a-to-z/17-cross-region-dr-account-baseline-90-day-runbook.md |
| IAM Access Analyzer | An AWS-managed service that continuously analyses IAM policies + resource policies for over-privileged access and external trust-policy exposure; the production-account guardrail that flags "this S3 bucket is public" or "this role can be assumed by an unknown account" before a breach. | 14-eks-in-production-a-to-z/17-cross-region-dr-account-baseline-90-day-runbook.md |
| 90-day production-readiness runbook | The Part 14 capstone artefact: a structured 13-week onboarding plan for a team taking over an EKS production platform, with weekly checkpoints (state hygiene → cluster lifecycle → addons → storage → cost → CI/CD → networking → Graviton → GitOps → multi-region → supply chain → runtime defense → backup → eBPF → DX → capstone deliverable), each tied to a concrete artefact in examples/bookstore-platform/. |
14-eks-in-production-a-to-z/17-cross-region-dr-account-baseline-90-day-runbook.md |
Part 15 — Day-to-Day Production Operations terms¶
| Term | Definition | Covered in |
|---|---|---|
| PR-to-production lifecycle | The mental model Part 15 is built around: a change moves through eight stages (commit → PR → CI → review/merge → GitOps repo update → Argo CD reconcile → progressive rollout → SLO gate → done) and the load-bearing rule that every production change is a Git commit — nothing else. | 15-day-to-day-production-ops/01-pr-to-production-lifecycle.md |
GitHub Actions OIDC + role-to-assume |
The application-side use of OIDC trust: a GitHub Actions job sets permissions: id-token: write, calls aws-actions/configure-aws-credentials with role-to-assume: <ARN>, and gets short-lived STS credentials — no static AWS keys in repository secrets. Also see 14 ch.07. |
15-day-to-day-production-ops/02-application-cicd-pipelines.md |
| Branch-protection rules | GitHub repo settings that gate merges into a protected branch: require N approving reviews, require status checks, require linear history, dismiss stale reviews on push, require signed commits, disallow force-push; the merge-gate side of the CI discipline. | 15-day-to-day-production-ops/02-application-cicd-pipelines.md |
| Required status checks | The branch-protection subset that names the CI workflow jobs (lint, test, scan, build, sign) that MUST pass before a PR can merge; the Git-side enforcement of "tests are not optional", paired with required = true on each check. |
15-day-to-day-production-ops/02-application-cicd-pipelines.md |
| Cosign keyless via OIDC token | The CI-side application of keyless signing: a GitHub Actions job calls cosign sign --yes <IMAGE>, the workflow's OIDC token authenticates to Fulcio, the resulting cert lists the workflow's repo + branch + workflow-file path as subject; an admission policy can then accept only signatures from a specific workflow. Also see 14 ch.12. |
15-day-to-day-production-ops/03-image-signing-and-provenance.md |
| SLSA provenance attestation | A signed JSON document conforming to the SLSA provenance schema (subject digest + builder identity + build invocation + materials), produced by docker buildx --provenance=true or by cosign attest --predicate provenance.json; consumed at admission to verify "this image was built by this workflow on this commit". Also see 14 ch.12. |
15-day-to-day-production-ops/03-image-signing-and-provenance.md |
| Multi-environment promotion (dev / staging / prod gates) | The dev → staging → prod pipeline as Git mechanics: three Kustomize overlays differ only in image-tag + secret-source + scale, promotion is a Git PR that bumps images[].newTag in the next environment's overlay, each environment gates on its own Argo CD sync + analysis run. |
15-day-to-day-production-ops/04-multi-environment-promotion.md |
Argo CD ApplicationSet (Cluster generator) |
The ApplicationSet generator that fans one template into N Applications — one per registered Argo CD cluster — with cluster labels driving overlay path / branch / target namespace; the production primitive for "same app, three environments" or "same app, N regions". Also see 11 ch.06. |
15-day-to-day-production-ops/04-multi-environment-promotion.md |
| Vault Kubernetes auth method | Vault's auth backend that trusts the cluster's ServiceAccount projected tokens: a Pod presents its projected token, Vault verifies it against the cluster's JWKS, looks up the bound role, and issues a Vault token with attached policies — the production-grade alternative to AppRole or static tokens. Also see 11 ch.05. | 15-day-to-day-production-ops/05-production-secrets-vault-eso.md |
| Vault dynamic database secrets | A Vault secrets engine (database/) that mints a short-lived Postgres username + password on each request, with TTL + max-TTL; the app gets a fresh credential per lease, Vault revokes on TTL expiry, and a leaked credential becomes worthless within minutes. |
15-day-to-day-production-ops/05-production-secrets-vault-eso.md |
| External Secrets Operator (ESO) — production | The production deepening of ESO: a real HA Vault ClusterSecretStore, per-tenant ExternalSecret resources, refreshInterval tuned against Vault lease TTL, conflict policy + Helm-templated target.template so the resulting Kubernetes Secret carries app-shaped keys rather than raw Vault paths. Also see 11 ch.05. |
15-day-to-day-production-ops/05-production-secrets-vault-eso.md |
| Secret rotation (lease TTL / refresh interval) | The two-clock rotation discipline: Vault lease_ttl decides how often the source credential changes; ESO refreshInterval decides how often the Kubernetes Secret re-fetches; the two clocks must satisfy refreshInterval << lease_ttl so Pods always see a valid credential. |
15-day-to-day-production-ops/05-production-secrets-vault-eso.md |
Argo Rollouts AnalysisTemplate (production SLO gate) |
The CRD that defines reusable metric queries (Prometheus / Datadog / NewRelic / WebExpression) with success / failure thresholds; referenced by Rollout.spec.strategy.canary.analysis to gate promotion on real SLO metrics (success-rate, p99 latency, saturation) rather than wall-clock pauses. Also see 07 ch.05. |
15-day-to-day-production-ops/06-progressive-delivery-in-production.md |
| Argo Rollouts canary vs blue-green (production) | The two production rollout strategies with different traffic semantics: canary shifts a percentage at a time (works for idempotent stateless services); blue-green flips 100% on success (the only safe choice for stateful workloads where in-flight transactions can't be split across versions). | 15-day-to-day-production-ops/06-progressive-delivery-in-production.md |
| Argo Rollouts auto-rollback | The default Rollout behaviour when an AnalysisRun fails: the controller automatically aborts the rollout, shifts 100% of traffic back to the stable ReplicaSet, scales the new ReplicaSet to 0, and emits a RolloutAborted event — the production safety net that makes "deploy to prod" survivable. |
15-day-to-day-production-ops/06-progressive-delivery-in-production.md |
| Rollback layer matrix (code / data / config) | The decision matrix every production platform needs: code rollback (Argo CD revision pin, Argo Rollouts abort, Helm rollback) when the new binary is bad; data rollback (Postgres PITR, S3 versioning, Velero restore) when data was corrupted; config rollback (git revert an Argo CD Application's targetRevision) when a manifest change caused the symptom — picking the wrong layer makes the outage worse. |
15-day-to-day-production-ops/07-rollback-playbook.md |
| Postgres point-in-time recovery (PITR) | The Postgres recovery shape backed by continuous WAL archiving: every committed transaction's WAL segment ships to object storage, and a restore can roll forward to any recovery_target_time; CloudNativePG implements this as Cluster.spec.backup + Cluster.spec.recovery.recoveryTarget.targetTime. |
15-day-to-day-production-ops/07-rollback-playbook.md |
| S3 versioning rollback | The simplest data-layer rollback: an S3 bucket with versioning enabled retains every object version (and delete-marker); restoring is aws s3api copy-object of the previous version on top of the current key, or a bulk replay via S3 Inventory + Batch Operations. |
15-day-to-day-production-ops/07-rollback-playbook.md |
| Forward-compatible schema (rollback prerequisite) | The disciplinary requirement for safe code rollback: every database schema change ships in two phases — additive change first (new column / new table / nullable defaults) deployed and stabilized, then the code that uses it; rolling back the code leaves the schema valid because the old code never read the new column. | 15-day-to-day-production-ops/07-rollback-playbook.md |
| Feature flag (vs config) | A run-time boolean / variant decision served by a flag service, distinguishing two cohorts in one binary — vs an environment config value, which is applied at deploy time and applies to all traffic; the discipline that decouples deploy (binary lands) from release (feature becomes visible). | 15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md |
| OpenFeature | A CNCF-incubating vendor-neutral feature-flag SDK + spec: app code calls a single client.GetBooleanValue("flag", default, evalCtx) API; the actual provider (Flagsmith / LaunchDarkly / Unleash / GoFeatureFlag / in-memory) is configured by a Provider implementation injected at startup. |
15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md |
| Flagsmith / LaunchDarkly / Unleash (feature-flag providers) | Three production options behind OpenFeature: Flagsmith (open-source, self-hosted, the chapter default), LaunchDarkly (managed SaaS; lower ops cost, data-residency surcharge), Unleash (open-source self-hosted alternative with a stronger client-side SDK story); pick on hosting + data-residency + budget. | 15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md |
| Dark launch (deploy ≠ release) | The production shape where new code ships to production behind a flag with default-off; the binary is live but the feature is invisible, log-only, or restricted to internal traffic; the flag flip is the release event. | 15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md |
| Kill-switch flag | A boolean flag whose only purpose is to instantly disable a code path in production without a deploy; rolling the kill switch from true → false flips behaviour within the flag service's TTL — typically seconds — vs a deploy + rollout that takes minutes. |
15-day-to-day-production-ops/08-feature-flags-and-dark-launches.md |
| Hotfix workflow + breakglass | The emergency lane when normal CI/CD is too slow for a P0: branch-protection bypass via repo admin, CI fast-path that keeps scan but skips slow integration suites, a breakglass IAM role with full admin + 1-hour TTL + every action audited to CloudTrail, and the post-incident cleanup that rotates credentials + drift-checks Terraform. |
15-day-to-day-production-ops/09-hotfix-workflow-and-breakglass.md |
| Breakglass IAM role (time-limited admin) | An IAM role with full admin permissions but a 1-hour STS session TTL + an explicit assume-role trust policy requiring MFA + a CloudTrail alarm on every assume-event; assumed only during P0s, and the post-incident step rotates the role's keys + reviews every action it took. | 15-day-to-day-production-ops/09-hotfix-workflow-and-breakglass.md |
| Audit-log immutability (CloudTrail Stop/Delete denied) | An IAM service-control policy (SCP at the org level, or an explicit deny in the production-account boundary) that forbids CloudTrail:StopLogging, :DeleteTrail, and S3:DeleteObject against the audit-log bucket — even for the breakglass admin role; an attacker with admin still cannot erase the trace. |
15-day-to-day-production-ops/09-hotfix-workflow-and-breakglass.md |
| Incident severity matrix (P0 / P1 / P2 / P3) | The triage scale every production team needs: P0 = user-visible outage / data loss (page everyone, war room); P1 = degraded experience for many users; P2 = single-service or low-impact; P3 = noise / known-issue / scheduled-fix; each tier maps to a different page policy, escalation cadence, and postmortem requirement. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Mean time to acknowledge (MTTA) | The on-call metric measuring the time between a page firing and a human responding (typically ack-via-PagerDuty); the leading indicator of "are alerts paging the right person?" — target is single-digit minutes for P0, low double-digits for P1. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Mean time to mitigate (MTTM) | The on-call metric measuring time between page-ack and the customer-visible symptom being mitigated (not necessarily root-caused); the practical SLA for "how long was the user impacted" — target depends on tier but is typically 15 min P0, 1 hour P1. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Mean time to detect (MTTD) | The on-call metric measuring time between the real start of an incident (in the logs / metrics) and the page firing; the leading indicator of "are we monitoring the right thing?" — low MTTD requires user-facing SLO alerts, not just CPU / disk thresholds. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| 5 Whys analysis | The root-cause technique borrowed from Toyota: ask "why did that happen?" five times in sequence, each answer becoming the next question; the structured way to push past the proximate cause (the pod OOMed) to a contributing cause (the request payload grew 10x because a feature flag flipped) to a systemic cause (no canary on flag flips). | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Postmortem deadline: 48h draft + 5-day publish | The deadline discipline that prevents postmortem rot: a draft (timeline + impact + action items) within 48h of the incident, a published + reviewed postmortem within 5 working days; missing either deadline triggers an escalation to the engineering manager rather than a slip. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| On-call handoff | The weekly handover ritual between primary on-call shifts: outgoing engineer walks incoming through (1) live incidents, (2) the page-volume last week, (3) the open-action-items dashboard, (4) any runbook gaps discovered; turns on-call from "luck of the draw" into a continuous improvement loop. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Alertmanager inhibition rules (production) | Prometheus Alertmanager inhibit_rules that suppress noisy child alerts when a parent root-cause alert is firing (e.g. RegionDown inhibits every per-service alert in that region); the production-grade defence against pager storms during cluster-wide outages. Also see 13 ch.09. |
15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Runbook URL annotation (alert hygiene) | The mandatory annotations.runbook_url on every PrometheusRule alert: clicking it from PagerDuty / Slack opens the alert's specific runbook (symptoms → diagnosis → mitigation → owners). The hygiene rule: an alert without a runbook URL fails CI. |
15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| War room (synchronous video bridge for P0) | The synchronous response artefact for P0 incidents: a always-open Zoom / Meet / Teams bridge in the on-call channel, the IC + SME + comms-owner join immediately, status is broadcast in chat to keep async stakeholders informed; the structure that turns three siloed Slack threads into one converging response. | 15-day-to-day-production-ops/10-incident-response-and-on-call.md |
| Cost review cadence (weekly Monday) | The fixed weekly slot when the platform team reviews last week's cost (OpenCost per-tenant + AWS Cost Explorer + savings-plan utilisation) and decides on follow-up actions; without a fixed cadence cost reviews drift into "we'll look next quarter" and the bill grows. | 15-day-to-day-production-ops/11-day-to-day-production-ops.md |
| Capacity review cadence (bi-weekly Friday) | The fixed bi-weekly slot for capacity decisions: nodepool sizes, Karpenter disruption.budgets, HPA min/max, PDBs, request-vs-limit drift; the forum where "we should bump the system NodePool" stops being a Slack message and becomes a tracked change. |
15-day-to-day-production-ops/11-day-to-day-production-ops.md |
| 90-day production-ownership runbook | The Part 15 capstone artefact: a structured 90-day plan for a team taking over a production Bookstore Platform v2 — weeks 1–4 (orient + on-call shadowing + the lifecycle), weeks 5–8 (own the change discipline: CI/CD + signing + rollback + flags + hotfix), weeks 9–12 (own production: incidents + cadence + the 90-day check-in), with explicit checkpoints + deliverables + readiness scorecard. | 15-day-to-day-production-ops/12-capstone-first-90-days.md |
See also: Appendix A — kubectl cheatsheet for the commands behind these terms, Appendix C — YAML & API conventions for the API/SSA/deprecation mechanics, and the official Kubernetes glossary: https://kubernetes.io/docs/reference/glossary/.