Deployment Strategies: Rolling, Blue-Green, Canary
What This Concept Is
Three standard strategies for replacing a running version without downtime. Each makes different tradeoffs on resource cost, rollback speed, blast radius during the swap, and operational complexity.
- Rolling update. Replace instances N at a time. Old and new run side-by-side briefly. When the new pool is fully healthy, the old is gone. Default for most orchestrators (Kubernetes
Deployment, ECS service update). - Blue-green. Stand up the full new fleet ("green") alongside the current production fleet ("blue"). When green is validated, switch routing from blue to green atomically (load balancer, DNS, or service selector). Roll back by switching back.
- Canary. Route a small percentage of real traffic (1%, then 5%, then 25%, then 100%) to the new version while monitoring metrics. Abort and shift traffic back if thresholds are breached.
A fourth strategy, shadow / dark traffic, duplicates production traffic onto the new version without surfacing responses to users -- useful for catching performance regressions before any real user sees a response. It's discussed in concept 8 as a form of dark launch.
Why It Matters Here
"Deploying" without a strategy is what caused the pre-2010 industry's deploy-on-Saturday-at-3am culture. Picking the right strategy is the difference between:
- a 30-second rollback and a 30-minute one
- catching a bad release at 1% of users and catching it at 100%
- 1x resource cost during deploy and 2x resource cost during deploy
- a single request-boundary error and a cross-service correlated incident
Concrete Comparison
| Property | Rolling | Blue-green | Canary |
|---|---|---|---|
| Resource cost during deploy | 1x + a few extra | 2x (two full fleets) | 1x + a small slice |
| Rollback speed | minutes (reverse roll) | seconds (flip router) | seconds (shift traffic back) |
| Blast radius during deploy | gradual (%) | 0% then 100% (atomic) | controlled % |
| Mixed versions in prod | yes, briefly | no | yes, deliberately |
| Good for stateless services | yes | yes | yes |
| Good for stateful services | risky (see DB concept) | very risky | risky |
| Complexity to set up | low | medium | high (needs traffic split + metrics) |
| Requires automated analysis | no | no | effectively yes |
Concrete Example: Kubernetes Rolling Update
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # up to 2 extra pods during the roll
maxUnavailable: 0 # never drop below 10 healthy pods
selector:
matchLabels: { app: api }
template:
metadata:
labels: { app: api }
spec:
containers:
- name: api
image: ghcr.io/acme/api@sha256:abc...
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 3
maxUnavailable: 0 and a real readiness probe are the two settings that make this actually safe. Without them, rolling update is a silent outage generator. In production, also set a PodDisruptionBudget so the roll cannot evict more than your minimum-healthy count when combined with cluster maintenance -- Kubernetes deploy safety is the intersection of Deployment strategy, PDB, and HPA, not any one of them alone.
Concrete Example: Blue-Green by Service Selector
Two deployments (api-blue, api-green), one Service. The swap is a one-line selector change:
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
color: green # was "blue"; change + kubectl apply switches all traffic
ports:
- port: 80
targetPort: 8080
Rollback is editing color: green back to color: blue. Seconds. The "green" fleet must be kept warm (connection pools, JITs) before the switch -- cold-start spikes at swap-time are the most common blue-green bug.
Concrete Example: Canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: api
The paired AnalysisTemplate defines the promotion criterion -- typically a success-rate or p99-latency query against Prometheus -- which you must state explicitly (see concept 9 for rollback criteria). Flagger is the alternative operator, with a GitOps-friendly reconciler model; both end up at the same "step + analysis + abort" structure.
Common Confusion / Misconception
"Rolling update is the safest default." It is the cheapest default, not the safest. Rolling cannot abort cleanly once started; once new pods are serving traffic, a rollback is itself a rolling update in reverse.
"Blue-green means zero downtime." Only for stateless services where both fleets can run concurrently. For stateful systems (shared databases, long-lived connections, caches) blue-green requires careful coordination or it causes correlated errors at the swap. Long-lived websockets and gRPC streams are especially hostile to blue-green because they don't "drain" -- they hold on until the server forces a disconnect.
"Canary and blue-green are alternatives." They compose. A common pattern is canary into a green fleet: start by routing 5% of traffic to a green fleet, ramp up, then decommission blue. Google's SRE Workbook documents this as the standard large-scale rollout.
"All traffic-split deployments need a service mesh." Not required. Ingress controllers (NGINX, Traefik), cloud load balancers (ALB weighted target groups), and gateway APIs all support traffic splitting natively. A mesh (Istio, Linkerd) adds power but is not a prerequisite.
"Canary percentages are the control." They are one control. Per-user or per-region canaries (only internal users; only eu-west-1) are often better for risky changes because the blast radius is bounded even if percentages aren't respected evenly by hashing.
How To Use It
Pick by service shape:
- Stateless internal service, low traffic -> rolling, done.
- User-facing service where a bad deploy hurts real customers -> canary with automated rollback.
- Service that cannot tolerate mixed versions (protocol breaking change, large DB migration) -> blue-green.
- Regulated/batch systems with maintenance windows -> blue-green or in-place after window.
Whichever you choose, write down:
- the exact rollback command or action
- how long rollback takes
- who authorizes rollback at 3am (hint: it should be the on-call engineer, no approvals)
- the exact metric + threshold that would trigger an automatic rollback (see concept 9)
Check Yourself
- Give one case where blue-green is riskier than rolling, and one where it is safer.
- What two Kubernetes
Deploymentsettings make a rolling update actually safe? - Why is the rollback of a rolling update slower than the rollback of a blue-green?
- Which strategy lets you catch a bug at 1% of users?
- What does combining canary into a blue-green fleet give you that neither does alone?
Mini Drill or Application
Pick a service you know. Fill in this table:
| Question | Answer |
|---|---|
| Current deploy strategy | |
| Rollback command | |
| Rollback wall-clock time | |
| Who is authorized to roll back without approval | |
| What automatic signal would trigger rollback |
If any row is blank or "we'd figure it out," that is the gap to close before the next deploy.
See also (external)
- Martin Fowler: BlueGreenDeployment -- canonical short definition
- Martin Fowler: CanaryRelease -- origin of the term
- Kubernetes -- Deployments (rolling update) --
maxSurge,maxUnavailable, readiness - Kubernetes -- PodDisruptionBudget -- eviction safety across rolls and cluster maintenance
- Argo Rollouts -- Canary strategy -- production-grade canary controller
- Flagger -- progressive delivery -- alternative Kubernetes canary operator
- Google SRE Workbook -- Canarying Releases -- Google's operational take
- AWS -- ALB weighted target groups for blue/green -- cloud LB traffic split without a mesh
Source Backbone
CI/CD behavior must be checked against official tool docs, but these books provide the durable release-engineering backbone.
- Pro Git - branching, tags, signing, and release history.
- GitHub Actions in Action - workflow and automation support.
- Software Engineering at Google - engineering-process and reliability context.