14.07 — Infrastructure CI/CD + drift detection¶
Plan-on-PR, apply-on-merge via GitHub Actions with OIDC trust (no long-lived AWS access keys) is the production CI/CD shape every EKS Terraform tree should reach within its first month; the nightly
terraform plan -detailed-exitcodedrift check + the "someone changed it in the console" reconciliation runbook are what keep the state in code matching the state in cloud.
Estimated time: ~30 min read · ~90 min hands-on Prerequisites: Part 14 ch.01 — remote state with locking required for CI · Part 14 ch.02 — upgrade flow that CI must drive · Part 12 ch.01 — plan-on-PR pattern from app delivery
You'll know after this: • configure GitHub Actions with OIDC trust to AWS (no long-lived access keys) · • implement the plan-on-PR, apply-on-merge shape every EKS Terraform tree should reach within its first month · • run nightly terraform plan -detailed-exitcode drift checks and act on exit code 2 · • follow the reconciliation runbook when "someone changed it in the console" · • debug a stuck CI apply by inspecting the S3 lock and recovering safely
Why this exists¶
The bookstore-platform tree at
../examples/bookstore-platform/terraform/
gets applied two ways during development: make plan then make up
on a laptop, or the same flow via a CI workflow against a remote
state backend. Both work for "I am alone and learning"; neither
works once a team is involved. The laptop apply has at least five
durable failures:
- No audit trail. Who applied what when? The state file changes record the resource changes, not the human who initiated the apply.
- Concurrent apply collisions. Two engineers apply simultaneously;
one of them wins, the other gets stale plans. The S3 backend's
use_lockfile = true(Part 14 ch.01) prevents corruption but does not arbitrate intent. - No review. A
terraform planis the most consequential document a platform engineer produces — the actual list of cloud changes about to occur — and a laptop apply hides it from reviewers entirely. - AWS credentials on the laptop. Long-lived IAM access keys in
~/.aws/credentialson N engineers' machines is the canonical blast-radius problem: any one of those machines lost or compromised compromises the account. - No drift detection. Whatever the team applied two months ago is what's in the .tf; whatever someone hand-edited in the console last Tuesday is what's in AWS. They drift apart silently until the next apply blows up.
Phase 14-R shipped the production shape:
.github/workflows/terraform.yml
runs terraform plan on every PR (with the plan output posted as a
PR comment) and terraform apply on every merge to main. The
workflow uses GitHub OIDC trust
to assume an IAM role — no AWS access keys are ever stored in
GitHub Secrets. Drift detection ships under
drift-check/:
a nightly workflow runs terraform plan -detailed-exitcode and opens
a deduplicated GitHub issue on any non-zero diff.
This chapter walks the why and the wiring: why OIDC instead of long-lived keys, the five-pattern landscape (laptop / Jenkins / GitHub Actions / Atlantis / Spacelift), where each wins, what a "console emergency" looks like, and how to reconcile drift without making it worse.
Part 07 ch.03 introduced application-level CI/CD (build, test, push, deploy a microservice). Part 14 ch.01 (Phase 14a) established the remote-state backend that infrastructure CI/CD depends on. This chapter is the infrastructure CI/CD layer: how Terraform itself flows through PR → review → merge → apply, and how drift is policed.
In production: Long-lived AWS access keys in any CI system are the number-one preventable security incident. The fix is not rotation (still leaks; rotation is post-hoc). The fix is OIDC trust: GitHub/GitLab/CircleCI mint a short-lived (10-15 minute) JWT per workflow run; STS exchanges it for credentials that exist only for that run. No secrets to rotate, no keys to leak. The bookstore-platform's
.github/workflows/terraform.ymluses this from line 111-116 (theaws-actions/configure-aws- credentials@v4step).
Mental model¶
Infrastructure CI/CD is the application of the same merge-driven discipline that already governs application code to the Terraform tree: plan-on-PR (human review), apply-on-merge (automation), with OIDC for credentials (no long-lived secrets) and nightly drift detection (closing the console-edit loophole). The five-pattern landscape — laptop, Jenkins, GitHub Actions, Atlantis, Spacelift — maps to team size + control-plane preference.
The five-pattern landscape:
- Pattern 1 — Laptop apply. One human, one tree, one laptop.
Fine for the learning phase; fine for the "I am the platform"
one-person team. Fails on every dimension once two humans share
the tree. Use only for: bootstrap (Part 14 ch.01's
bootstrap- state.sh), demos, throwaway experiments. - Pattern 2 — Shared Jenkins / GitLab CI. Jenkins-server-on-EC2 with Terraform jobs. Was the dominant 2015-2020 pattern; still in many shops. Issues: Jenkins itself needs auth, scaling, and upgrade — it becomes another piece of infrastructure to maintain. AWS credentials typically live as Jenkins credentials (better than laptops, worse than OIDC). Use for: teams already running Jenkins for app CI/CD where adding Terraform jobs is incremental.
- Pattern 3 — GitHub Actions / GitLab CI with OIDC. Workflows in the same repo as the Terraform; runs on managed runners (or self-hosted GitHub-Actions runners on EC2 if regulatory). Credentials via OIDC (no static secrets). The default recommendation for greenfield; the Bookstore Platform ships this. Issues: GitHub Actions UI is not optimized for Terraform plan output (the plan-output-as-PR-comment trick mitigates).
- Pattern 4 — Atlantis. Self-hosted Terraform-aware webhook
server. The flow is comment-driven (
atlantis plan,atlantis applyposted in the PR). The killer feature: per-project workflow configuration viaatlantis.yaml— different terraform versions, different roles, different require-approval policies per subdirectory. Use for: monorepos with N Terraform projects each needing different governance. - Pattern 5 — Spacelift / Env0 / Terraform Cloud. Managed Terraform-as-a-Service. Policy-as-code, drift detection, run history, state management — all included. Most polished but also most expensive ($120-300/user/month at production scale). Use for: orgs that want to outsource Terraform infrastructure entirely; orgs with strict compliance that need the audit features the SaaS vendors provide.
The selection algorithm is roughly:
| Team size | Pick | Why |
|---|---|---|
| 1 dev | Laptop + remote state | Pattern overhead exceeds value at scale 1. |
| 2-5 devs | GitHub Actions + OIDC | Free, native to GitHub, OIDC closes the secret-leak risk. |
| 5-20 devs, one tree | GitHub Actions + Atlantis (optional) | Atlantis adds per-PR comment-driven discipline. |
| 5-20 devs, many trees | Atlantis or Spacelift | Per-project workflow becomes the value. |
| 20+ devs, regulated | Spacelift / Terraform Cloud | Audit + policy-as-code at scale. |
OIDC trust — the cryptographic exchange. A workflow runs in
GitHub Actions; GitHub Actions mints a short-lived JWT carrying
claims {repository: <ORG>/<REPO>, ref: refs/heads/main, ...}.
The workflow calls sts:AssumeRoleWithWebIdentity against an IAM
role; the role's trust policy validates the JWT's signature
(against GitHub's published JWKS endpoint) and checks the claims
against StringEquals conditions. If the claims match, STS
returns temporary credentials (default 1 hour). The credentials
exist only inside the workflow's environment; no human ever sees
them; nothing is stored.
Drift detection — the three patterns:
| Pattern | Tool | When |
|---|---|---|
Scheduled terraform plan -detailed-exitcode |
GitHub Actions cron / Jenkins | The simplest and most universal; the Bookstore Platform's drift-check/nightly-drift.yml ships this. |
driftctl |
OSS CLI, ex-CloudSkiff | Stronger because it scans the cloud for resources Terraform doesn't manage at all — catches the "someone created a resource by hand" case. |
| AWS Config + custom rules | AWS Config service | Cloud-native; useful when the org already uses Config for compliance; aligns drift with the compliance posture. |
Each catches a different drift class: plan -detailed-exitcode
catches attribute drift on managed resources; driftctl catches
unmanaged resources; AWS Config catches compliance violations
on either. A mature platform runs all three.
The trap to keep in view: applying through a drift fix without
understanding the cause makes things worse. The DRIFT.md
runbook's Step 3a-3d (Phase 14-R) covers this; the canonical
mistake is: drift detected → engineer panics → terraform apply
without reading → Terraform "fixes" the drift by reverting a
hand-edit that was actually intentional (the SG rule someone
opened to fix an outage). Three minutes of investigation before
apply prevents three hours of incident recovery.
Diagrams¶
Diagram A — the PR-to-apply flow with OIDC (Mermaid)¶
sequenceDiagram
participant Dev as Developer
participant GH as GitHub
participant Run as GitHub Actions runner
participant OIDC as token.actions.githubusercontent.com
participant STS as AWS STS
participant TF as terraform
participant AWS as AWS API
Dev->>GH: Open PR with .tf changes
GH->>Run: Trigger plan workflow
Run->>OIDC: Request OIDC token
(aud: sts.amazonaws.com)
OIDC-->>Run: Short-lived JWT
(claims: repo, ref, sha)
Run->>STS: AssumeRoleWithWebIdentity(jwt)
STS->>STS: Verify JWT signature
+ check trust policy claims
STS-->>Run: Temporary credentials
(60 min default)
Run->>TF: terraform plan
TF->>AWS: read-only API calls
AWS-->>TF: state
TF-->>Run: plan output
Run->>GH: Post plan as PR comment
+ upload tfplan artifact
Dev->>GH: Review plan, approve PR
GH->>Run: Trigger apply on merge
Run->>OIDC: New OIDC token
OIDC-->>Run: Short-lived JWT
Run->>STS: AssumeRoleWithWebIdentity(jwt)
STS-->>Run: Temporary credentials
Run->>TF: terraform apply tfplan
TF->>AWS: write API calls
AWS-->>TF: applied state
TF-->>Run: apply output
Diagram B — drift sources + detection (ASCII)¶
DRIFT SOURCE EXAMPLES CAUGHT BY
────────────────────────────── ───────────────────────────────────────── ──────────────────────────────
Console edit (human) SG rule opened to "fix prod" plan -detailed-exitcode (next run)
CloudTrail (who + when)
AWS auto-update EKS addon auto-upgrade plan -detailed-exitcode
(lifecycle ignore_changes is the fix)
Tool side-effect Karpenter desired_capacity on ASG plan -detailed-exitcode
(lifecycle ignore_changes)
Unmanaged resource IAM user created by an old script driftctl scan
(terraform won't see it)
Compliance violation S3 bucket made public via console AWS Config custom rule
(alerts immediately)
Out-of-band Terraform apply Someone applies from a different branch State lock + nightly plan
────────────────────────────── ───────────────────────────────────────── ──────────────────────────────
Order of defense:
1. Lock state (Part 14 ch.01) — prevents simultaneous out-of-band applies.
2. plan -detailed-exitcode nightly — catches attribute drift in 24h.
3. driftctl weekly — catches unmanaged-resource drift in 7d.
4. AWS Config rules continuously — catches compliance drift immediately.
Hands-on with the Bookstore Platform¶
0. Prerequisites¶
- A GitHub repository containing the bookstore-platform tree (or a fork).
- AWS account access; you can create an IAM identity provider + role.
- Repository admin permissions to set repository secrets + variables.
1. Create the OIDC identity provider in AWS¶
# One-time per AWS account. Creates the IAM identity provider that
# trusts GitHub's OIDC issuer.
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com \
--thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1
# Verify it landed.
aws iam list-open-id-connect-providers
The thumbprint is the SHA-1 of GitHub's TLS CA certificate. AWS used to require it; modern AWS regions skip the validation if the issuer URL is known. The value above is the documented one (rotates infrequently); if AWS errors out, fetch the current thumbprint from the AWS GitHub OIDC docs.
2. Create the IAM role + trust policy¶
# Trust policy — only allows the role to be assumed from the main
# branch of your specific repo.
cat <<EOF > trust-policy.json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::<AWS_ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
"token.actions.githubusercontent.com:sub": "repo:<GITHUB_ORG>/<GITHUB_REPO>:ref:refs/heads/main"
}
}
}]
}
EOF
# Create the role.
aws iam create-role \
--role-name github-actions-terraform \
--assume-role-policy-document file://trust-policy.json
# Attach AdministratorAccess (broad — narrow for production).
aws iam attach-role-policy \
--role-name github-actions-terraform \
--policy-arn arn:aws:iam::aws:policy/AdministratorAccess
# Output the role ARN.
aws iam get-role --role-name github-actions-terraform \
--query 'Role.Arn' --output text
The trust-policy condition above pins to one branch (main) of
one repo. Production teams that deploy from multiple environments
use ForAnyValue:StringEquals with a list of exact refs — see the
header comments in
.github/workflows/terraform.yml
for the canonical pattern. Avoid the broad StringLike
"repo:ORG/REPO:*" pattern in any role with cluster-admin reach — it
lets any branch (including PR branches that external contributors
can open) assume the role.
3. Configure the GitHub repo¶
# Set the secret (the role ARN you just printed).
gh secret set AWS_ROLE_ARN -b "arn:aws:iam::<AWS_ACCOUNT_ID>:role/github-actions-terraform"
# Set the variable (region).
gh variable set AWS_REGION -b "us-east-1"
The workflow at
.github/workflows/terraform.yml
already references secrets.AWS_ROLE_ARN and vars.AWS_REGION. Once
both are set, the workflow runs.
4. Trigger the plan job¶
# Make a trivial change to the Terraform tree.
cd examples/bookstore-platform/terraform
git checkout -b test/cicd-wiring
# Bump a tag, for example, in locals.tf.
# (use your editor; this is just to produce a non-trivial diff)
git commit -am "test: trigger plan workflow"
git push origin test/cicd-wiring
# Open a PR via gh:
gh pr create --title "test cicd wiring" --body "smoke test"
In the GitHub Actions tab, the plan job runs. Inside the job:
Configure AWS credentials via OIDCexchanges the GitHub OIDC token for STS credentials.terraform planruns against the actual cloud.- The plan is uploaded as a workflow artifact.
- (Optional, with an extra step like
infracost comment) the plan output is posted as a PR comment.
The PR's "Checks" tab shows the plan green or red. A reviewer reads the plan before approving — that is the entire point.
5. Merge the PR, watch the apply¶
After approval and merge:
# The apply job runs on push to main. Watch it.
gh run watch
The apply job:
- Configures AWS credentials via OIDC (same dance as plan).
- Downloads the plan artifact from the merge commit's predecessor PR.
- Runs
terraform apply tfplanagainst that exact reviewed plan.
The key safety invariant: what merges is what was reviewed. The
plan binary uploaded during the PR is the same plan applied at
merge time; no re-planning, no opportunity for the cloud to have
drifted between review and apply. (Read the apply job's caveat
comments in the workflow file — squash-merge breaks the artifact-
lookup-by-PR-number trick, in which case the workflow falls back
to re-planning, which is the worse-but-functional option.)
6. Configure the nightly drift check¶
The drift-check workflow at
drift-check/nightly-drift.yml
is shipped under the drift-check/ subdirectory because it's a
template — copy it to your repo's .github/workflows/ directory:
# In YOUR repo (not the bookstore-platform sample):
cp examples/bookstore-platform/terraform/drift-check/nightly-drift.yml \
.github/workflows/
# Adjust the WORKING_DIR env to point at your tree.
# Commit + push.
The workflow runs at 03:00 UTC daily (configurable cron); the trigger logic:
terraform plan -detailed-exitcodereturns:0— no drift; the workflow exits cleanly.1— plan errored (auth, syntax, network) — different problem; workflow fails so on-call notices.2— drift detected; the workflow proceeds.
On exit code 2, the workflow uses actions/github-script@v7 to
either:
- Open a new GitHub issue labeled
drift-detected, OR - Append a dated comment to an existing open issue with that label (deduplication — keeps the issue list signal-to-noise high).
The issue body includes the truncated plan output (60 KB cap to fit
in a GitHub issue body) and a link to
drift-check/DRIFT.md
— the runbook for what to do next.
7. Simulate drift and walk the runbook¶
# In the AWS console: tag a resource that Terraform manages with a
# tag Terraform doesn't know about. Easiest target: the cluster
# Security Group, in EC2 -> Security Groups.
# Add tag: Drift = test
# (or any tag the Terraform doesn't set)
# Next morning, the nightly-drift workflow runs and creates an issue:
# Title: "Drift detected on 2026-05-22"
# Body: includes the plan output showing the tag drift.
# Investigate per DRIFT.md Step 1-2.
cd examples/bookstore-platform/terraform
terraform plan -detailed-exitcode
# Exit code: 2 (drift)
# Look at the diff. The tag change is in the plan output.
# Step 2: CloudTrail — who added the tag?
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=<SECURITY_GROUP_ID> \
--max-results 5 --start-time "$(date -u -d '1 day ago' +%Y-%m-%dT%H:%M:%SZ)"
# Step 3a: the change is good (it's a benign tag). Roll forward.
# Edit vpc.tf to add the tag to the security_group_tags block.
# terraform plan now shows zero diff.
# Commit + open PR + merge through CI as normal.
The runbook's full Step 1-4 (the four flavors of reconciliation) is in DRIFT.md; this section is the smoke test that the workflow + the runbook + the team are wired together.
8. (Optional) Add an Atlantis variant¶
The bookstore-platform ships an
atlantis.yaml.example
showing the same drift pattern via Atlantis. The two workflows are
mutually exclusive in practice — pick one. The reasons to add
Atlantis:
- Multiple Terraform projects in one monorepo, each needing different governance (different terraform versions, different roles, different approval policies).
- Comment-driven workflow (
atlantis plan,atlantis applyposted as PR comments) feels more natural to the team than the GitHub-Actions-status-check pattern. - You already run Atlantis for application Terraform; adding the bookstore tree is incremental.
If GitHub Actions + OIDC is meeting your needs, don't add Atlantis. The extra moving piece (you now run Atlantis itself — it's a webhook server on a Kubernetes Deployment, probably on the cluster it manages) earns its keep only at scale.
9. (Optional) Run driftctl weekly¶
driftctl catches a class plan -detailed-exitcode misses:
resources in the cloud that Terraform doesn't manage at all. The
classic example: an IAM user created by an old console script, an
S3 bucket spun up "temporarily" two years ago. Terraform's plan
ignores anything not in its state; driftctl scans the cloud and
reports.
# Install driftctl
brew install driftctl
# Or: curl -L https://github.com/snyk/driftctl/releases/latest/download/driftctl_darwin_amd64 -o driftctl && chmod +x driftctl
# Scan against the bookstore-platform state.
driftctl scan \
--from tfstate+s3://<TFSTATE_BUCKET_NAME>/bookstore-platform/terraform.tfstate \
--output 'json://drift.json' \
--quiet
# Result: a JSON report with three categories — managed/in-sync,
# managed/drifted, unmanaged. The "unmanaged" list is the audit.
jq '.summary' drift.json
# {
# "total_resources": 124,
# "total_changed": 0,
# "total_unmanaged": 7, # <-- what driftctl finds, plan misses
# "total_missing": 0
# }
A weekly CI job that runs driftctl scan and posts the unmanaged
list to a Slack channel is the durable defense. The drift-check/
subdirectory ships a stub for this; uncomment and configure for
your repo.
How it works under the hood¶
OIDC — the cryptographic exchange in detail. Three trust steps:
-
GitHub mints a JWT. The token's payload carries claims:
The JWT is signed with GitHub's private key; the public key is published at{ "iss": "https://token.actions.githubusercontent.com", "aud": "sts.amazonaws.com", "sub": "repo:<GITHUB_ORG>/<GITHUB_REPO>:ref:refs/heads/main", "repository": "<GITHUB_ORG>/<GITHUB_REPO>", "ref": "refs/heads/main", "sha": "abc123...", "actor": "<USERNAME>", "iat": 1684567890, "exp": 1684568790 }https://token.actions.githubusercontent.com/.well-known/jwks.json. -
STS validates the JWT. The IAM role's trust policy declares GitHub's OIDC provider as a trusted federation source. STS:
- Fetches GitHub's JWKS endpoint.
- Verifies the JWT signature against the JWKS public keys.
- Checks the JWT's
audclaim against the role's allowed audience. - Checks the JWT's
subclaim against the role'sStringEqualscondition. -
If all checks pass, calls
AssumeRoleWithWebIdentityinternally. -
STS returns short-lived credentials. Default lifetime: 1 hour. The credentials are temporary —
AccessKeyId+SecretAccessKey SessionToken. The workflow uses them via the standard AWS SDK env-var mechanism (theconfigure-aws-credentialsaction setsAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN, plusAWS_DEFAULT_REGION).
The "no static secret" win: at no point does an access key live outside the workflow's in-memory environment. The runner is ephemeral; the credentials expire in an hour; if a workflow is compromised, the damage is bounded to that workflow's session.
The plan-artifact handoff between jobs. The PR's plan job
serializes the binary plan (terraform plan -out=tfplan) and
uploads it to GitHub's artifact storage (actions/upload-artifact@v4).
On merge, the apply job downloads the artifact (actions/download-
artifact@v4) and applies it directly: terraform apply -input=false
-auto-approve tfplan. This achieves exact-plan-as-reviewed
semantics: the cloud state that the reviewer saw cannot drift
between review and apply.
The catch (called out in the workflow's apply-job comments):
actions/download-artifact needs to know which workflow run
produced the artifact. The simple "match by name and most recent"
fails for squash-merged PRs (the merge commit isn't directly
linked to the PR). Three options:
- Use merge commits (not squash) — preserves the PR number on the merge SHA.
- Use the pull_request
closed: merged: trueevent instead ofpush: branches: [main]— exposes the PR number on the apply trigger. - Fall back to re-plan on apply — the workflow's current behavior if the artifact isn't found. Functional but loses the exact-plan-as-reviewed guarantee.
terraform plan -detailed-exitcode semantics. Three exit codes:
0— no changes (the plan would be a no-op).1— error (auth failure, syntax error, network issue).2— changes present (Terraform would make changes).
The drift-check workflow uses exit code 2 as the drift signal. Note that "would make changes" is the union of:
- Changes you intended (someone pushed a tf change that's not applied yet — usually this means the apply job failed previously).
- Drift from the cloud-side (someone edited via console).
The drift workflow doesn't distinguish; both cases warrant investigation. Reading the actual plan output (in the issue body) clarifies which.
Atlantis vs GitHub Actions — the architectural difference.
Atlantis is a stateful webhook server — it maintains its own
workspace per project per branch, lock-contention coordination,
and atlantis.yaml-driven workflow definitions. GitHub Actions is
stateless — each workflow run starts fresh, builds its working
directory from the checkout. Atlantis's stateful approach offers
better incremental performance (cached .terraform/ directory)
and richer per-project workflow customization; GitHub Actions's
statelessness is simpler and free.
For teams running 1-5 Terraform projects: GitHub Actions wins on ops simplicity. For 20+ projects: Atlantis's per-project workflow customization pays for itself.
Spacelift / Terraform Cloud — the SaaS variant. Both are "Terraform as a managed service": you push code, they run the plan, they offer a UI to review and approve, they apply on approval. The appeal: drift detection, run history, policy-as-code (Sentinel / OPA), state encryption, audit logs — all included. The cost: $10-50 per user/month for Terraform Cloud's "Standard" tier; $30-100+ for Spacelift's commercial tiers. At 20+ engineers, the per-user fee becomes material; at 5 engineers, it's noise compared to engineering time saved.
Production notes¶
In production: Pin the trust-policy
subclaim to exact branch refs, neverStringLikewildcards on a role with cluster-admin reach. The header comments in.github/workflows/terraform.ymlspell out theForAnyValue:StringEqualspattern for multi- environment trust. AStringLike "repo:ORG/REPO:*"lets any branch — including a PR branch a contributor opens against your repo — assume the role. The blast radius is the role's IAM permissions; for an Administrator-level role, the radius is the entire account.In production: Use environments + required reviewers for the apply step. The workflow's apply job has a commented-out
environment: productionline. Uncomment it, configure required reviewers in the repo's environment settings, and an apply requires explicit human approval after merge. The default apply-on-merge is fine for sandboxes; production-blast-radius applies need the second human eye.In production: The squash-merge / artifact-lookup interaction bites. The workflow's apply step tries to download the plan artifact via PR number; squash-merge doesn't preserve the PR number on the merge SHA. The fix is either (a) use merge commits not squash, or (b) re-plan on apply (the current fallback, functional but loses the exact-plan-as-reviewed property). The work continues — open question, the current code uses (b).
In production: Don't put long-lived credentials in GitHub Secrets, period. The number of leaks where a
AWS_ACCESS_KEY_IDwas logged viaechoin a workflow step, posted to the workflow log (which is sometimes public for OSS repos), and exfiltrated in minutes is not small. OIDC removes this entire class of incident. If your repo doesn't support OIDC (older GitHub Enterprise, weird forks), use the lowest-permission service-account-pattern: a per-environment IAM user with resource-tag-scoped policies, rotated weekly via a cron.In production: Drift detection runs against a production Terraform state, not a dev one. A staging-cluster drift is usually fine to ignore for a day; a production-cluster drift isn't. Configure the drift workflow to fail loudly (Slack + page + on-call ack) only when drift is on the production tree; the dev workflow can file the issue without paging.
In production: The
lifecycle.ignore_changesknob is a double-edged sword. It silences known false-positive drift (EKS auto-updating addon versions, Karpenter manipulating ASG capacity), but it also silences real drift on the same attributes. Thedrift-check/DRIFT.mdrunbook's "Common ignore_changes patterns" section lists the canonical ones; resist adding more without a real drift example justifying it. An over-appliedignore_changesis how teams discover their CMK rotation was hand-disabled three months ago.In production: State backups before destructive applies. The S3 backend's versioning (Part 14 ch.01) is automatic, but for a high-risk apply (a migration that touches the EKS cluster resource itself, for example), explicitly snapshot:
terraform state pull > tfstate.<DATE>.jsonbefore running the apply. The cost is 30 seconds; the recovery time from a botched state file is measured in hours-to-days.
Quick Reference¶
# OIDC provider + role setup (one-time per account).
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com \
--thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1
aws iam create-role --role-name github-actions-terraform \
--assume-role-policy-document file://trust-policy.json
aws iam attach-role-policy --role-name github-actions-terraform \
--policy-arn arn:aws:iam::aws:policy/AdministratorAccess
# Repo secrets + variables.
gh secret set AWS_ROLE_ARN -b "arn:aws:iam::<AWS_ACCOUNT_ID>:role/github-actions-terraform"
gh variable set AWS_REGION -b "us-east-1"
# Trigger the drift check manually.
gh workflow run terraform-drift-check
# Run terraform plan locally with the drift exit codes.
terraform plan -detailed-exitcode
# Exit 0=no diff, 1=error, 2=drift detected.
# driftctl scan against remote state.
driftctl scan \
--from tfstate+s3://<TFSTATE_BUCKET_NAME>/bookstore-platform/terraform.tfstate
Minimal skeleton — the GitHub Actions plan-on-PR job:
# .github/workflows/terraform.yml — excerpt
name: terraform
on:
pull_request:
paths: ['examples/bookstore-platform/terraform/**']
permissions:
id-token: write # required for OIDC
contents: read
jobs:
plan:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with: { terraform_version: '<TF_VERSION>' }
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ vars.AWS_REGION }}
audience: sts.amazonaws.com
- run: terraform init -input=false
- run: terraform plan -input=false -no-color -out=tfplan
- uses: actions/upload-artifact@v4
with: { name: tfplan, path: tfplan, retention-days: 7 }
CI/CD-readiness checklist (production-shape when all six are yes):
- AWS OIDC identity provider created; trust policy pinned to
exact branch refs; no
StringLikewildcards on admin roles. - Plan-on-PR workflow posts plan output as a PR comment; reviewers can read it before approving.
- Apply-on-merge workflow uses the reviewed plan artifact (not re-plan), with a required-reviewer environment gate.
- Nightly drift workflow runs
terraform plan -detailed-exitcodeand opens/dedups a GitHub issue on exit code 2. - Weekly
driftctlscan catches unmanaged-resource drift. - DRIFT.md runbook linked from the drift issue body; on-call has practiced the four reconciliation flavors.
Test your understanding¶
Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.
-
Why does the chapter argue OIDC trust replaces (rather than improves) long-lived AWS access keys in CI?
Show answer
Long-lived access keys are a credential that exists *outside the run* — stored in GitHub Secrets, copied to dev laptops, sometimes screenshot accidentally. Rotation is post-hoc: by the time you rotate, the leaked key might already have been used. OIDC trust mints a short-lived JWT *per workflow run*, exchanged for STS credentials that live ~1 hour and only inside the runner's environment. There's nothing static to rotate, nothing to leak. The IAM role's trust policy uses `StringEquals` against the GitHub `sub` claim (repo + branch), so even if a fork tries to assume the role, the trust policy rejects it. -
A nightly drift check fires: 2 resources have drifted. The issue body shows that someone widened the EKS API server CIDR in the AWS console at 14:30 yesterday. What are the four reconciliation flavors the chapter teaches and which do you pick here?
Show answer
The four flavors: (1) **Revert in cloud** — undo the console change to match code; (2) **Codify in Terraform** — accept the change and PR it into `.tf`; (3) **Document the exception** — accept the divergence with explicit reasoning (rare); (4) **Both/neither** — case-by-case. For this incident: ask why it happened. If it was an emergency response to a P0 (the operator widened CIDR to debug from a new IP), codify it in a PR with the runbook reference. If it was someone exploring the console without context, revert immediately and rotate the IAM role's permissions to remove console-write where it's not needed. -
You inherit a Terraform tree where
applyruns from a Jenkins server that holds AWS access keys in Jenkins credentials. The org wants to migrate to GitHub Actions + OIDC. What's the safe migration order?
Show answer
(1) Create the OIDC identity provider + trust policy in AWS, pinned to specific branches; (2) Add the new GitHub Actions plan workflow running alongside Jenkins (plan only, no apply); (3) Validate the plan output matches Jenkins's plan for a few PRs; (4) Cutover apply: disable Jenkins apply, enable GHA apply on merge; (5) Wait a week of stable applies; (6) Delete the IAM user holding Jenkins access keys + delete the keys themselves. Skipping step 5 is how teams discover edge cases (a workflow that needs a permission the Jenkins user had but the OIDC role doesn't); keeping Jenkins's keys around "just in case" defeats the migration's security gain. -
Hands-on extension — manually create an S3 bucket in the AWS console outside Terraform, then run
terraform plan -detailed-exitcodevsdriftctl scan. Compare what each finds.
What you should see
`terraform plan -detailed-exitcode` returns 0 (no drift) — it only checks resources Terraform *manages*; an unmanaged bucket is invisible. `driftctl scan` returns a list including the unmanaged bucket because it scans the cloud for everything in scope and compares to state. This is why the chapter recommends both: plan catches attribute drift on managed resources, driftctl catches unmanaged-resource drift. Either alone leaves a class of "someone made it in the console" hidden.
Further reading¶
- GitHub OIDC for AWS — official docs https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services; the canonical guide to the trust-policy pattern this chapter implements.
aws-actions/configure-aws-credentialshttps://github.com/aws-actions/configure-aws-credentials; the GitHub Action that performs the STS exchange; supports IAM roles via OIDC, SAML, and the legacy static-key path.- Atlantis docs https://www.runatlantis.io/docs/; the per-project workflow + the comment-driven plan/apply pattern; the canonical alternative to Pattern 3 for teams with N Terraform projects.
- driftctl
https://github.com/snyk/driftctl; the unmanaged-resource scanner;
the complement to
terraform plan -detailed-exitcodethat catches drift the plan can't see. - HashiCorp — Terraform Cloud https://developer.hashicorp.com/terraform/cloud-docs; the SaaS variant of Pattern 5, including the run-history + policy-as-code features.
- Spacelift docs https://docs.spacelift.io/; the commercial SaaS variant covering drift detection, OPA policy enforcement, and per-stack workflow.
- AWS — IAM roles for service accounts vs OIDC for CI https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_oidc.html; the underlying IAM federation mechanism that OIDC builds on.