14.05 — Logging & metrics cost discipline¶

CloudWatch Logs charges $0.50/GB ingested plus $0.03/GB-month stored, and a single chatty controller emitting debug logs into a log group with no retention cap is the canonical "the cluster cost more than the application this month" incident; the discipline is retention caps, cardinality control, Loki-vs-CloudWatch trade-offs, and the three cost levers (sample, scrub, retain) every platform needs to know about before the bill arrives.

Estimated time: ~30 min read · ~60 min hands-on Prerequisites: Part 14 ch.02 — EKS control-plane log streams enabled on the cluster · Part 10 ch.05 — CloudWatch + Loki wiring you now must cost-tune · Part 11 ch.04 — generic observability cost framing

You'll know after this: • understand CloudWatch Logs pricing ($0.50/GB ingest + $0.03/GB-month store) and where the runaway-cost incident lives · • set retention caps on every log group, including EKS control-plane streams · • choose between CloudWatch vs Loki for app logs based on access pattern · • apply the three cost levers (sample, scrub, retain) before the bill arrives · • debug metrics-cardinality blowups (label explosions, Prometheus series spikes) in production

Why this exists¶

The bookstore-platform tree at ../examples/bookstore-platform/terraform/ turns on five EKS control-plane log streams in eks.tf (api, audit, authenticator, controllerManager, scheduler) and the AWS Load Balancer Controller + Karpenter Helm releases each create their own application log groups. On a quiet day this generates a few hundred megabytes; on a busy day with a misbehaving controller in a hot reconcile loop, a single log group can ingest 5-10 GB in 24 hours.

Run that for a year with no retention cap and the math is unforgiving:

5 GB/day ingestion at $0.50/GB = $2.50/day = $913/year in ingestion charges per log group.
Stored forever at $0.03/GB-month with 5 GB/day accumulation = 1.8 TB after one year = $54/month and rising in storage charges per log group.

Multiply by the ~10 log groups an EKS-deployed platform typically creates (control plane + LB controller + Karpenter + CoreDNS + each managed addon + application namespaces shipped to CloudWatch via Fluent Bit) and "CloudWatch ate my budget" stops being a meme. It is the most common AWS bill-shock the bookstore platform owners hit between cluster-up and the first finance review.

Phase 14-R closed the obvious gap: eks.tf line 51 sets cloudwatch_log_group_retention_in_days = var.cloudwatch_log_group_retention_in_days (default 30 days) on the EKS module, which propagates retention to /aws/eks/<CLUSTER>/cluster. That single variable converts the unbounded-growth scenario to a bounded one: at 30-day retention, the storage line stops growing after a month and stabilizes at 5 GB/day x 30 days x $0.03/GB-month = $4.50/month per log group — real money but predictable.

But retention is only the first lever. The other two — sampling (drop high-volume low-value logs at the source) and scrubbing (strip PII before it crosses an ingestion boundary) — are what turn a $1000/month observability bill into a $200/month one. This chapter is the lever-by-lever playbook against the bookstore-platform tree.

Part 06 ch.02 introduced Prometheus + Grafana + Loki as the in-cluster observability stack; the lesson there was architecture (the three pillars, recording rules, label discipline). This chapter is the EKS-specific cost overlay: which logs are worth keeping, which metrics are worth scraping, and where the bill explodes if you don't watch the cardinality.

In production: Every log group an EKS cluster creates must have a retention cap set within the first 24 hours of the cluster's life. A log group with no retention_in_days set on the AWS side defaults to Never expire — and AWS has no "delete after a year" policy on top of that. A team that creates 10 unbounded log groups in January discovers in December that they've paid $600 in storage for logs nobody has read since February. The bookstore-platform tree's cloudwatch_log_group_retention_in_days = 30 is the opinionated default; tighten to 7 for dev clusters, loosen to 90 or 365 for compliance audits.

Mental model¶

Observability cost = (ingestion volume x ingestion price) + (retained volume x storage price) + (metrics cardinality x storage price). Three levers reduce each term: sample at the source, scrub at the boundary, cap retention at the sink. Choose the right tool per workload class — CloudWatch for the control plane, Loki/Mimir for the application plane, Prometheus for metrics with disciplined cardinality.

The four dimensions of observability cost on EKS:

CloudWatch Logs — the EKS default, the bill default. AWS-managed, no operational burden, but priced for managed convenience: $0.50/GB ingested + $0.03/GB-month stored. Every EKS control-plane log stream (api, audit, etc.) lands in CloudWatch by default. The audit log alone can hit several GB/day on a cluster with chatty operators. Use for: control plane, compliance retention, occasional debug. Don't use for: every Pod's stdout (too expensive at scale).
Loki — the in-cluster cost-efficient alternative. Self-hosted, S3-backed object store, index-on-label (not on log content), significantly cheaper at scale (single-digit cents per GB stored on S3 vs $0.03/GB-month CloudWatch storage), but operationally heavier (you run the Loki Helm release; you manage compaction; you set up Grafana dashboards). Break-even is around 100 GB/month of ingested logs — below that, CloudWatch's managed convenience wins; above that, Loki pays for itself within a quarter.
Prometheus + cardinality. Metrics are cheap per scrape, but cardinality (the number of unique label combinations) is the cost trap. A metric http_requests_total with three labels (5 methods x 100 paths x 10 status codes) has 5,000 series; adding a user_id label with 10,000 unique users makes it 50 million series — a single chart-level metric that fills a Prometheus instance's TSDB in minutes. The discipline: low-cardinality labels on every metric (request method, response status, route template — not user IDs, request IDs, full URL paths).
The three cost levers. In rough order of leverage:
Sample at the source. Drop debug logs in production; emit INFO and above. For metrics, scrape at 30s not 5s for non-SLO metrics. For traces, sample 1% of traffic + 100% of errors. Reduces ingestion 10-100x.
Scrub at the boundary. Strip PII, drop high-cardinality fields, redact secrets. A Fluent Bit filter that drops the authorization header out of every request log costs nothing and cuts the compliance-audit-scope drastically.
Cap retention at the sink. 30 days for operational, 90 for compliance, 7 for dev. The retention cap is the cheapest lever (one Terraform variable) and the highest dollar-impact for retained storage; it does not reduce ingestion cost.

CloudWatch vs Loki — the break-even chart. A simple decision chart for the application logging pipeline:

Volume	Pick	Why
< 10 GB/month	CloudWatch	Managed; the ops time saved exceeds the cost differential.
10-100 GB/month	Either	Decision is operational maturity, not cost.
> 100 GB/month	Loki on S3	Cost differential exceeds Loki operational overhead.
Compliance-bound logs (audit)	CloudWatch	The 7-year retention story is easier; AWS handles the lifecycle.
Bursty application stdout	Loki	The cost-spike is bounded by your S3 storage class.

The trap to keep in view: the audit log goes to CloudWatch forever unless you cap it. The EKS audit log stream captures every API request — kubectl get pods from a developer, every controller's reconcile-loop list-watch, every Argo CD sync. A busy cluster can emit 1-2 GB/day of audit. At default unbounded retention, that's 30 GB after a month, 365 GB after a year, accumulating forever. The 30-day cap turns this into a manageable line item; extend to 90 or 365 only when the compliance team has signed off on the bill.

Diagrams¶

Diagram A — the cost flow from controller stdout to AWS bill (Mermaid)¶

flowchart LR
    subgraph cluster["EKS cluster"]
      cp["Control plane
api / audit / authenticator
controllerManager / scheduler"]
      app["Application Pods
stdout/stderr"]
      ctrl["Controllers
(LB controller / Karpenter)
own log groups"]
      fb["Fluent Bit DaemonSet
sample + scrub at source"]
    end

    subgraph cw["CloudWatch Logs"]
      cwcp["/aws/eks//cluster
retention 30d"]
      cwctrl["/aws/lb-controller
/aws/karpenter
retention 30d"]
      cwapp["/aws/eks//app
retention 7d"]
    end

    subgraph loki["Loki (in-cluster)"]
      ls["S3 backend
$0.023/GB-month
lifecycle to IA after 30d"]
    end

    subgraph bill["AWS bill"]
      ing["Ingestion
$0.50/GB"]
      sto["Storage
$0.03/GB-month"]
    end

    cp -->|"AWS-managed pipe"| cwcp
    ctrl -->|"awslogs driver"| cwctrl
    app --> fb
    fb -->|"low-volume / compliance"| cwapp
    fb -->|"high-volume / app"| ls

    cwcp --> ing
    cwctrl --> ing
    cwapp --> ing
    cwcp --> sto
    cwctrl --> sto
    cwapp --> sto
    ls -.->|"S3, not CloudWatch"| bill

    note1["Retention cap is set by
cloudwatch_log_group_retention_in_days
in eks.tf (default 30)."]
    cwcp -.-> note1

Diagram B — cost-lever leverage table (ASCII)¶

LEVER             COST IMPACT       LATENCY TO SAVINGS    OPS OVERHEAD     EXAMPLE
────────────────  ────────────────  ────────────────────  ───────────────  ──────────────────────────────
Retention cap     20-90% of stored  Stabilises after the  1 var, 1 apply   30d cap on /aws/eks/.../cluster
                  data after cap    cap window elapses                     turns infinite -> 30 day footprint
Sampling          90-99% of         Immediate             Fluent Bit cfg   Drop debug logs in prod;
                  ingested vol                            or app logger    100% errors + 1% info
Scrubbing         5-20% (small      Immediate             Fluent Bit cfg   Drop authorization header;
                  unless PII)                                              redact account numbers
CloudWatch -> Loki 70-90% of       Migration              Heavy initial    Apps to Loki; audit stays in
                  vol > 100GB/mo    project (1-2 weeks)   then steady      CloudWatch for compliance
Cardinality drop  100x metrics     Immediate              App-side label   Remove user_id label from
                  storage                                  scrubbing       http_requests_total
────────────────  ────────────────  ────────────────────  ───────────────  ──────────────────────────────
Order of attack:
  1. Cap retention (easiest, biggest stored-cost win)
  2. Drop debug logs in production (90% of stdout volume)
  3. Drop high-cardinality labels in metrics (Prometheus survives, you save TBs)
  4. Migrate apps to Loki if you're > 100 GB/month

Hands-on with the Bookstore Platform¶

0. Prerequisites¶

The bookstore-platform tree applied; cluster reachable via kubectl.
AWS CLI v2 configured with permissions to read CloudWatch + Cost Explorer.
A baseline: at least 48 hours of cluster activity so the log groups have some volume to inspect.

1. Inspect the log groups the cluster created¶

CLUSTER="$(terraform output -raw cluster_name)"
REGION="$(terraform output -raw region)"

# Every log group whose name starts with /aws/eks/<CLUSTER>.
aws logs describe-log-groups \
  --region "$REGION" \
  --log-group-name-prefix "/aws/eks/$CLUSTER" \
  --query 'logGroups[].{name:logGroupName,retention:retentionInDays,bytes:storedBytes}' \
  --output table

Sample output:

-------------------------------------------------------------------------
|                         DescribeLogGroups                             |
+-----------------------------------------+-----------+-----------------+
|                  name                   | retention |     bytes       |
+-----------------------------------------+-----------+-----------------+
| /aws/eks/bookstore-platform/cluster     |   30      |   2147483648    |
+-----------------------------------------+-----------+-----------------+

Confirm retention = 30. If retention = null, the log group is never-expire — the cluster was deployed without the Phase 14-R retention variable set; fix it immediately:

aws logs put-retention-policy \
  --region "$REGION" \
  --log-group-name "/aws/eks/$CLUSTER/cluster" \
  --retention-in-days 30

2. Discover the unbounded log groups Terraform didn't create¶

# AWS Load Balancer Controller, Karpenter, addon log groups — auto-created
# by the controllers, NOT managed by Terraform. They often default to
# never-expire.
aws logs describe-log-groups \
  --region "$REGION" \
  --query 'logGroups[?retentionInDays==null].logGroupName' \
  --output table

Any log group in this list is bleeding money. Cap them:

# Bulk cap every unmanaged log group to 30 days.
aws logs describe-log-groups \
  --region "$REGION" \
  --query 'logGroups[?retentionInDays==null].logGroupName' \
  --output text | \
while read -r LG; do
  echo "Capping $LG to 30 days..."
  aws logs put-retention-policy \
    --region "$REGION" \
    --log-group-name "$LG" \
    --retention-in-days 30
done

Production note: the cap-script is a one-shot. The durable fix is either a Terraform resource per log group (verbose but explicit) or a aws_cloudwatch_log_group Terraform resource with a wildcard + for_each that the addon discovers. Most teams pick the script-on-a- cron approach (run weekly via the drift-check workflow in Part 14 ch.07, Phase 14b).

3. Measure ingestion volume per log group¶

# CloudWatch Metrics — IncomingBytes is the ingestion-cost metric.
# Sum over the last 7 days, grouped by log group.
END_TS="$(date -u +%Y-%m-%dT%H:%M:%S)"
START_TS="$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S)"

aws cloudwatch get-metric-statistics \
  --region "$REGION" \
  --namespace AWS/Logs \
  --metric-name IncomingBytes \
  --dimensions Name=LogGroupName,Value="/aws/eks/$CLUSTER/cluster" \
  --start-time "$START_TS" \
  --end-time "$END_TS" \
  --period 86400 \
  --statistics Sum \
  --query 'Datapoints[].{day:Timestamp,gb:Sum}' \
  --output table

Each Sum is bytes ingested that day; divide by 1024^3 for GB. Multiply by $0.50/GB to estimate the daily ingestion line on the bill. A healthy bookstore-platform cluster runs at ~200-500 MB/day per log group; sustained values above 2 GB/day indicate a controller in a hot reconcile loop and demand investigation.

4. Find the chatty controller¶

# Find the noisiest log stream in the audit log group.
aws logs describe-log-streams \
  --region "$REGION" \
  --log-group-name "/aws/eks/$CLUSTER/cluster" \
  --order-by LastEventTime \
  --descending \
  --max-items 10 \
  --query 'logStreams[].{name:logStreamName,size:storedBytes}' \
  --output table

# Tail the audit log for one minute to spot the chatty caller.
aws logs tail "/aws/eks/$CLUSTER/cluster" \
  --region "$REGION" \
  --since 1m \
  --filter-pattern '{ $.user.username = "*" }' \
  --format short | \
  jq -r 'select(.user) | .user.username' | \
  sort | uniq -c | sort -rn | head

Sample output:

   8421 system:serviceaccount:karpenter:karpenter
   3104 system:serviceaccount:kube-system:aws-node
    421 system:serviceaccount:argocd:argocd-application-controller
     12 arn:aws:iam::<ACCOUNT_ID>:user/<USER>

A normal Karpenter list-watch is high-volume but bounded; a Karpenter emitting 100k+ audit events per hour is in a reconcile loop and the fix is usually a misconfigured NodePool or a CRD version mismatch (see Part 10 ch.06's Karpenter debug runbook).

5. Set a CloudWatch alarm on ingestion volume¶

# Alarm when daily ingestion exceeds 5 GB on the audit log.
aws cloudwatch put-metric-alarm \
  --region "$REGION" \
  --alarm-name "${CLUSTER}-cluster-log-volume-high" \
  --alarm-description "EKS cluster log ingestion above 5 GB/day" \
  --metric-name IncomingBytes \
  --namespace AWS/Logs \
  --statistic Sum \
  --period 86400 \
  --evaluation-periods 1 \
  --threshold 5368709120 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=LogGroupName,Value="/aws/eks/$CLUSTER/cluster" \
  --alarm-actions "$(terraform output -raw budget_alarm_sns_topic_arn 2>/dev/null || echo '')"

This alarm reuses the SNS topic that Phase 14-R's cost-budgets.tf provisions when enable_budget_alarm = true. If you haven't enabled the budget alarm yet, the topic doesn't exist; either enable it (Part 14 ch.06 walks through that) or create an SNS topic explicitly for log-volume alerts.

6. Migrate application logs to Loki¶

The application namespaces in the bookstore-platform tree currently ship to CloudWatch via the EKS-default Fluent Bit DaemonSet. For non-compliance-bound app logs, route them to Loki instead:

# Install Loki (pinned) for the cost-efficient ingestion path.
LOKI_VERSION="6.18.0"

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install loki grafana/loki \
  --version "$LOKI_VERSION" \
  -n logging --create-namespace --wait \
  --set 'loki.auth_enabled=false' \
  --set 'loki.storage.type=s3' \
  --set "loki.storage.s3.bucketnames=<LOKI_BUCKET_NAME>" \
  --set "loki.storage.s3.region=$REGION" \
  --set 'singleBinary.replicas=1' \
  --set 'minio.enabled=false'

# Update Fluent Bit to route the apps namespaces to Loki, control plane
# remains on CloudWatch. The config map (excerpt):
cat <<EOF
[OUTPUT]
    Name        loki
    Match       kube.var.log.containers.*_bookstore-platform-*_*.log
    host        loki-gateway.logging.svc.cluster.local
    port        80
    labels      job=fluent-bit,cluster=$CLUSTER

[OUTPUT]
    Name        cloudwatch_logs
    Match       kube.var.log.containers.*_kube-system_*.log
    region      $REGION
    log_group_name /aws/eks/$CLUSTER/system
    log_stream_prefix fluentbit-
    auto_create_group true
EOF

The split (apps to Loki, system to CloudWatch) is the canonical cost-driven shape: apps generate 90%+ of the volume and have soft retention requirements; system logs are low-volume but audit-bound.

7. Audit the metrics cardinality before it explodes¶

# In a Prometheus-equipped cluster (Part 06 ch.01), find the top-N
# highest-cardinality metrics.
kubectl -n prometheus-system port-forward svc/kube-prometheus-stack-prometheus 9090:9090 &

# topk(10, count by (__name__)({__name__=~".+"}))
curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=topk(10, count by (__name__)({__name__=~".+"}))' | \
  jq -r '.data.result[] | "\(.value[1]) series  \(.metric.__name__)"' | sort -rn | head

Sample output:

 1842000 series  http_requests_total
  421000 series  container_network_receive_bytes_total
  315000 series  apiserver_request_duration_seconds_bucket
   52000 series  kube_pod_status_phase

A metric with 1.8 million series is almost certainly carrying a high-cardinality label (user_id, request_id, full URL path). Find the offending label:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(http_requests_total) by (__name__, route)' | \
  jq '.data.result | length'

A route label with 50,000+ unique values is the smoking gun — the application is exporting one series per exact URL path instead of per route template (/users/123 vs /users/{id}). The fix lives in the app's metrics exporter; Prometheus cannot fix this at scrape time.

How it works under the hood¶

CloudWatch Logs pricing — the per-byte math. AWS charges in two dimensions:

Ingestion — $0.50/GB at GA prices (us-east-1; other regions are the same or +5-10%). Calculated on the raw JSON event size at ingest time. Compression at storage is on top; you do not benefit from your application emitting compact log lines except for the ingestion-byte-count.
Storage — $0.03/GB-month on the post-compression footprint. CloudWatch compresses on store (~7-10x typical). A 30 GB ingestion in a month leaves ~3-4 GB on disk; storage cost is on the 3-4 GB.

The cost formula for a single log group, simplified:

month_cost = ingest_GB * 0.50         # one-shot ingestion
           + stored_GB * 0.03         # per month, for the retention window

For the EKS audit log at 2 GB/day raw, 30-day retention:

ingest = 60 GB/month * 0.50 = $30/month
stored = ~8 GB on disk * 0.03 = $0.24/month
total  = ~$30.24/month

The ingestion line dominates; storage is round-off. The leverage is in reducing ingestion volume (sampling, scrubbing); retention caps only constrain the stored term.

The "Never expire" default explained. CloudWatch log groups are created with retentionInDays = null unless explicitly set. AWS does not impose a global retention; you pay storage forever. This is by design (some workloads need 7-year retention), but it is also the single most common cost trap. The retentionInDays API field is the one knob; setting it to a number (1, 7, 30, 90, 365, etc.) schedules a background job that deletes events older than the cap. The deletion is asynchronous (AWS sometimes takes 24-72 hours after the cap elapses to actually delete), but the billing aligns with the cap immediately.

Loki's cost shape — S3 economics. Loki stores log "chunks" (compressed batches of N MB of logs from M streams) and an index that maps {labels} to chunk references. Both go to S3 by default:

S3 Standard storage — $0.023/GB-month for the first 50 TB.
S3 Standard-IA — $0.0125/GB-month for objects rarely accessed (lifecycle to IA after 30 days saves ~50% on aging data).
S3 Intelligent-Tiering — automatic; recommended for unknown access patterns.

A Loki deployment ingesting 100 GB/month and retaining 30 days costs approximately:

S3 standard storage: 100 GB * 0.023 = $2.30/month
PUT requests: ~1M * $0.005/1K = $5/month
GET requests on query: ~100k * $0.0004/1K = $0.04/month
Compute (Loki itself in-cluster): 0.5 CPU + 1 GB RAM = ~$15/month
Total: ~$22/month

vs CloudWatch at the same volume: 100 * 0.50 = $50/month ingestion alone, plus storage. Loki breaks even at ~20 GB/month ingestion and saves significant money above 100 GB.

Prometheus cardinality — why one label can break everything. Prometheus's TSDB stores one series per unique label-combination. Each series carries: a 16-byte hash, a label-name+value-pair-set, and a chunk-pointer array. The fixed-cost-per-series is roughly 3 KB in RAM + several hundred bytes on disk; 1 million series therefore costs ~3 GB RAM and ~500 MB disk per day. Crossing 10 million series typically OOMs the Prometheus Pod.

The cardinality killers in order:

user_id, customer_id, account_id — unbounded; never on a per-request metric.
request_id, trace_id — unbounded; never on any metric (these belong on logs/traces).
Full URL paths — unbounded if the app exports /users/123 not /users/{id}.
IP addresses — bounded but huge (billions of v4 addresses).
Free-form error_message labels — unbounded.

The discipline lives in the application's metrics exporter, not in Prometheus. Use route templates not exact paths; use error_code classifications not message strings; never put user/request IDs on metrics.

Sampling vs scrubbing vs retention — the three levers' mechanics.

Sampling at the source (the app emits less): the application configures its logger to drop DEBUG in production; an OpenTelemetry SDK samples at 1% via the TracesSampler config. Cost reduction multiplicative on every line that doesn't get emitted.
Sampling at the pipe (Fluent Bit drops): a [FILTER] block that does Regex match-and-exclude on common-uninteresting log shapes. Less common — usually the source is easier to control.
Scrubbing (the same volume, but smaller per line): Fluent Bit modify filter strips PII fields; the line still ships but is 10-50% smaller. Tiny dollar impact unless lines were enormous; large compliance-scope impact (no PII in CloudWatch keeps you out of GDPR/HIPAA evaluation for that log group).
Retention at the sink: the log/storage lifecycle policy. Doesn't reduce ingestion (which is paid before retention applies); reduces stored cost only.

Production notes¶

In production: Cap retention on every log group within 24 hours of the cluster's first apply, including the AWS-managed ones the EKS service auto-creates (LB controller, addon logs). The aws logs describe-log-groups --query 'logGroups[?retentionInDays==null]' query is the canonical drift check — schedule it weekly via the Phase 14b ch.07 drift workflow and alert on any never-expire group.

In production: Drop debug logs in production at the application level, not at the Fluent Bit level. The Bookstore Go services use structured logging with a LOG_LEVEL=info env in production manifests; the Helm chart's values-prod.yaml enforces it. A developer setting LOG_LEVEL=debug to chase a bug in production burns money for every second they forget to revert; a CI policy that rejects PRs setting debug in production manifests is the durable defense.

In production: Watch out for the "trace-id label on metrics" antipattern. New developers, accustomed to logs-and-traces having a trace ID, often add trace_id as a metric label "for correlation." This is the cardinality bomb. Prometheus's TSDB will not fit a per-request label; the metric goes to several million series within hours and the Pod OOMs. The fix is the code review — trace_id belongs on logs (Fluent Bit sends it to Loki + CloudWatch) and traces (Tempo), never on metrics.

In production: The Karpenter audit-log volume problem. Karpenter aggressively list-watches Node, NodeClaim, Pod resources; every list-watch is an audit event. A cluster with 200 nodes can easily emit 5-10 GB/day of audit just from Karpenter's reconcile traffic. The mitigation is an audit-policy that excludes specific system service accounts from audit logging — file an exclusion for system:serviceaccount:karpenter:karpenter on get/list/watch verbs against nodes/nodeclaims/pods. EKS exposes this via the cluster.audit_policy field on newer cluster versions; check what your version supports.

In production: S3 lifecycle on Loki chunks is non-optional. A Loki bucket without transition to Standard-IA after 30 days + Glacier Deep Archive after 90 days + delete after 365 days retains every chunk at standard storage indefinitely. The Helm chart documentation calls this out; the practice often slips. The examples/bookstore-platform/terraform/loki.tf (if you've added Loki via Terraform) should include an aws_s3_bucket_lifecycle_configuration resource explicitly.

In production: Cardinality alerting beats post-hoc forensics. Prometheus's own prometheus_tsdb_head_series metric exposes the total series count; alarm on rapid growth (rate(...[1h]) > 10000) and you catch a new label rollout before the TSDB fills. The bookstore-platform's examples/bookstore-platform/cost/opencost-tenant-aggregations.yaml includes a PrometheusCardinalityHigh alert as a template — copy it to your cluster's PrometheusRule.

In production: Compliance retention is a different problem from operational retention. Audit logs that an auditor asks for in year 7 are a separate stream from the operational logs you read this week. The pattern: 30-day operational retention in CloudWatch, plus a once-daily S3 export of the audit log group to a separate, cheap, write-once bucket with 7-year retention. AWS exposes this via the CloudWatch Logs aws logs create-export-task API or via subscription filters to Kinesis Firehose to S3.

Quick Reference¶

# Cap retention on every unmanaged log group (drift script).
aws logs describe-log-groups \
  --region "$REGION" \
  --query 'logGroups[?retentionInDays==null].logGroupName' \
  --output text | \
while read -r LG; do
  aws logs put-retention-policy --region "$REGION" \
    --log-group-name "$LG" --retention-in-days 30
done

# Daily ingestion volume per log group (the cost-driver metric).
aws cloudwatch get-metric-statistics --region "$REGION" \
  --namespace AWS/Logs --metric-name IncomingBytes \
  --dimensions Name=LogGroupName,Value="/aws/eks/$CLUSTER/cluster" \
  --start-time "$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 86400 --statistics Sum

# Top-cardinality metrics in Prometheus.
curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=topk(10, count by (__name__)({__name__=~".+"}))'

# Alarm on log-volume spike.
aws cloudwatch put-metric-alarm --alarm-name "${CLUSTER}-log-volume-high" \
  --metric-name IncomingBytes --namespace AWS/Logs --statistic Sum \
  --period 86400 --evaluation-periods 1 --threshold 5368709120 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=LogGroupName,Value="/aws/eks/$CLUSTER/cluster"

Minimal skeleton — Fluent Bit OUTPUT config that splits apps to Loki + system to CloudWatch:

# fluent-bit-config.yaml — excerpt
apiVersion: v1
kind: ConfigMap
metadata:
  name: <NAME>
  namespace: logging
data:
  fluent-bit.conf: |
    [OUTPUT]
        Name        loki
        Match       kube.var.log.containers.*_bookstore-platform-*_*.log
        host        loki-gateway.logging.svc.cluster.local
        labels      cluster=<CLUSTER_NAME>,job=fluent-bit

    [OUTPUT]
        Name        cloudwatch_logs
        Match       kube.var.log.containers.*_kube-system_*.log
        region      <REGION>
        log_group_name /aws/eks/<CLUSTER>/system
        log_retention_days 30
        auto_create_group true

Cost-discipline checklist (the bill is bounded when all six are yes):

Every log group has retentionInDays set (no never-expire).
The control-plane audit log is < 5 GB/day (alarm if exceeded).
Application logs > 100 GB/month route to Loki, not CloudWatch.
No metric carries user_id, request_id, or trace_id labels.
Production app log level is INFO (never DEBUG outside dev).
Compliance-bound logs export daily to a separate 7-year bucket.

Test your understanding¶

Try each before opening the answer drawer. The act of trying is the exercise; the answer is the check.

What are the three cost levers (sample, scrub, retain) and in what order do they have the most leverage?

Show answer
**Sample at the source** has the highest ingestion-cost leverage — dropping DEBUG logs in prod or scraping at 30s instead of 5s reduces ingestion 10-100x. **Scrub at the boundary** strips PII / high-cardinality fields before they cross an ingestion boundary; bounded cost saving but huge compliance-scope reduction. **Cap retention at the sink** is the cheapest lever (one Terraform variable) and the highest dollar-impact for *storage* costs but doesn't reduce ingestion. Order matters: sample first (kills volume), scrub second (kills compliance scope), retain third (caps the accumulator).
Finance flags a CloudWatch bill of $400/month for one cluster. You inherit it, run aws logs describe-log-groups, and see 12 log groups with retentionInDays: null. What do you do first, and what's the trap with rushing the fix?

Show answer
The unbounded log groups have been accumulating since cluster creation — storage cost dominates ingestion at this point. The fix is to set `retention_in_days` (via Terraform if managed, `aws logs put-retention-policy` if not) for each group; AWS will then begin expiring logs older than the cap on the next cycle. The trap: if any of those logs are compliance-bound (audit logs for SOC2/HIPAA), you can't just slap 7-day retention on them — coordinate with the compliance team, set the audit log to 90/365 days, and route a daily export to a cheaper S3 lifecycle-managed bucket. The bill drops within days; the compliance posture stays intact.
A developer adds user_id as a label on http_requests_total so the Grafana dashboard can break down latency by user. Two weeks later, Prometheus is OOMKilling. Why?

Show answer
Cardinality explosion. Prometheus stores one series per unique label-combination; adding a `user_id` label with 10,000 distinct users multiplies the existing series count by 10,000. A metric that was 5,000 series becomes 50,000,000 series — TSDB head series counts in the millions cause the WAL to grow unbounded and the head-block memory blows the Pod's `requests.memory`. The fix is removing the `user_id` label (use traces / exemplars / logs for per-user investigation), or moving the high-cardinality dimension to a sampled trace path. The chapter's discipline: never put user_id, request_id, trace_id, full URL paths on a Prometheus label.
Hands-on extension — at midnight, set the audit log stream's retention to Never expire (just to see), then run a kubectl get pods --all-namespaces -w for 30 minutes. Check ingestion the next day.

What you should see
The `watch` is a continuous list-watch against the apiserver; every event lands in the audit log. 30 minutes of watching with a busy cluster easily produces 50-100 MB of audit-log volume — at $0.50/GB ingested that's $0.025-$0.05 from one operator's terminal. Multiply by every Argo CD controller's reconcile-loop list-watch, every Karpenter scheduler watch, every Falco rule watcher running on the cluster 24/7, and you understand why the audit log alone can hit several GB/day. Re-cap retention to 30 days when done.

14.05 — Logging & metrics cost discipline¶

Why this exists¶

Mental model¶

Diagrams¶

Diagram A — the cost flow from controller stdout to AWS bill (Mermaid)¶

Diagram B — cost-lever leverage table (ASCII)¶

Hands-on with the Bookstore Platform¶

0. Prerequisites¶

1. Inspect the log groups the cluster created¶

2. Discover the unbounded log groups Terraform didn't create¶

3. Measure ingestion volume per log group¶

4. Find the chatty controller¶

5. Set a CloudWatch alarm on ingestion volume¶

6. Migrate application logs to Loki¶

7. Audit the metrics cardinality before it explodes¶

How it works under the hood¶

Production notes¶

Quick Reference¶

Test your understanding¶

Further reading¶