Module 4: CI/CD Pipelines & Release Engineering: Case Studies
These case studies focus on delivery safety: build once, promote, measure, deploy progressively, roll back, and secure the pipeline itself.
Case Study 1: DORA Metrics Reveal A Delivery Bottleneck
Scenario: A team says delivery is healthy because deployments are frequent. Incidents show high change failure rate and slow recovery.
Source anchor: Google Cloud's Four Keys metrics, which connects DORA metrics to delivery performance.
Module concepts: deployment frequency, lead time, change failure rate, MTTR.
Wrong Approach
Optimize only deployment frequency.
Better Approach
Measure the four together:
deployment frequency:
lead time:
change failure rate:
time to restore:
Tradeoff Table
| Metric | If ignored |
|---|---|
| deployment frequency | batches grow |
| lead time | slow feedback |
| change failure rate | speed hides instability |
| restore time | incidents last too long |
Required Artifact
Create a delivery-health dashboard and one improvement experiment.
Case Study 2: Build Once, Promote Everywhere
Scenario: CI builds one artifact for staging and another for production. The staging test passed on a different binary than the one deployed.
Source anchor: The Twelve-Factor App and modern release guidance emphasize strict separation of build, release, and run. See Twelve-Factor: Build, release, run.
Module concepts: immutable artifact, promotion, provenance, environment config.
Wrong Approach
Rebuild per environment.
Better Approach
Build once:
commit -> build image -> sign/tag digest -> deploy digest to staging -> promote same digest to prod
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| rebuild per env | easy variables | no artifact equivalence |
| promote same artifact | test/prod parity | config discipline |
| image digest | immutability | tooling required |
| mutable tag latest | convenience | audit risk |
Required Artifact
Write a promotion pipeline with artifact ID, environments, approvals, and rollback target.
Case Study 3: Canary Without Rollback Criteria
Scenario: A canary deploy sends 5% traffic to a new version. It stays live despite elevated checkout errors because nobody defined abort thresholds.
Source anchor: Kubernetes Deployment docs and progressive delivery guidance support controlled rollouts; use Google SRE monitoring concepts to choose symptoms. See Kubernetes Deployments and Google SRE monitoring.
Module concepts: canary, metrics, rollback, deployment marker, SLO.
Wrong Approach
"Canary" means small traffic, not safe traffic.
Better Approach
Define gates:
advance if:
p95 latency within 10%
error rate below threshold
checkout success unchanged
rollback if:
SLO burn exceeds threshold
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| manual canary | human judgment | slow/missed signals |
| automated gates | fast stop | metric quality required |
| blue-green | quick switch | duplicate capacity |
| rolling | efficient | harder instant rollback |
Required Artifact
Write canary stages, metrics, thresholds, duration, rollback command, and owner.
Case Study 4: CI Secrets Replaced With OIDC
Scenario: A GitHub Actions workflow stores a cloud access key. A forked workflow or log leak risks production credentials.
Source anchor: GitHub's OpenID Connect security hardening, which explains using OIDC tokens instead of long-lived secrets with cloud providers.
Module concepts: OIDC, short-lived credentials, CI identity, least privilege.
Wrong Approach
Put long-lived cloud keys in CI secrets.
Better Approach
Federate identity:
GitHub OIDC token -> cloud trust policy -> short-lived deploy role
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| static key | simple | leak/rotation risk |
| OIDC role | short-lived and scoped | trust-policy setup |
| broad role | fewer failures | high blast radius |
| env-scoped roles | safer | more policies |
Required Artifact
Write an OIDC trust policy review: repo, branch/environment, role permissions, and audit evidence.
Case Study 5: Database Migration Breaks Rolling Deploy
Scenario: A deploy removes a column while old pods still run. Old pods crash during the rolling update.
Source anchor: GitLab's post-deployment migration guidance describes separating dangerous database changes from code rollout. See GitLab post-deployment migrations.
Module concepts: expand/contract, backward compatibility, rolling deploy, migration ordering.
Wrong Approach
Deploy incompatible schema and code together.
Better Approach
Use expand/contract:
Release A:
add new column/table
code writes both if needed
Release B:
read new shape
Post-deploy:
remove old column after old code gone
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| one-step migration | simple | rolling deploy breakage |
| expand/contract | safe compatibility | more phases |
| post-deploy migration | lower downtime risk | process overhead |
| feature flag | decouples release | cleanup discipline |
Required Artifact
Write a migration rollout plan with compatibility matrix and rollback point.
Source Map
| Source | Use it for |
|---|---|
| Google Cloud Four Keys | DORA delivery metrics |
| Twelve-Factor: Build, release, run | immutable artifact promotion |
| Kubernetes Deployments | rollout and rollback mechanics |
| Google SRE monitoring | symptom metrics for release gates |
| GitHub Actions OIDC | secure cloud auth from CI |
| GitLab post-deployment migrations | safe schema rollout |
Completion Standard
- At least three artifacts are completed.
- At least one artifact tracks all four DORA metrics.
- At least one artifact promotes an immutable artifact.
- At least one artifact includes rollback criteria.