Skip to main content

Deployment Strategies: Rolling, Blue-Green, Canary

What This Concept Is

Three standard strategies for replacing a running version without downtime. Each makes different tradeoffs on resource cost, rollback speed, blast radius during the swap, and operational complexity.

  • Rolling update. Replace instances N at a time. Old and new run side-by-side briefly. When the new pool is fully healthy, the old is gone. Default for most orchestrators (Kubernetes Deployment, ECS service update).
  • Blue-green. Stand up the full new fleet ("green") alongside the current production fleet ("blue"). When green is validated, switch routing from blue to green atomically (load balancer, DNS, or service selector). Roll back by switching back.
  • Canary. Route a small percentage of real traffic (1%, then 5%, then 25%, then 100%) to the new version while monitoring metrics. Abort and shift traffic back if thresholds are breached.

A fourth strategy, shadow / dark traffic, duplicates production traffic onto the new version without surfacing responses to users -- useful for catching performance regressions before any real user sees a response. It's discussed in concept 8 as a form of dark launch.

Why It Matters Here

"Deploying" without a strategy is what caused the pre-2010 industry's deploy-on-Saturday-at-3am culture. Picking the right strategy is the difference between:

  • a 30-second rollback and a 30-minute one
  • catching a bad release at 1% of users and catching it at 100%
  • 1x resource cost during deploy and 2x resource cost during deploy
  • a single request-boundary error and a cross-service correlated incident

Concrete Comparison

PropertyRollingBlue-greenCanary
Resource cost during deploy1x + a few extra2x (two full fleets)1x + a small slice
Rollback speedminutes (reverse roll)seconds (flip router)seconds (shift traffic back)
Blast radius during deploygradual (%)0% then 100% (atomic)controlled %
Mixed versions in prodyes, brieflynoyes, deliberately
Good for stateless servicesyesyesyes
Good for stateful servicesrisky (see DB concept)very riskyrisky
Complexity to set uplowmediumhigh (needs traffic split + metrics)
Requires automated analysisnonoeffectively yes

Concrete Example: Kubernetes Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # up to 2 extra pods during the roll
maxUnavailable: 0 # never drop below 10 healthy pods
selector:
matchLabels: { app: api }
template:
metadata:
labels: { app: api }
spec:
containers:
- name: api
image: ghcr.io/acme/api@sha256:abc...
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 3

maxUnavailable: 0 and a real readiness probe are the two settings that make this actually safe. Without them, rolling update is a silent outage generator. In production, also set a PodDisruptionBudget so the roll cannot evict more than your minimum-healthy count when combined with cluster maintenance -- Kubernetes deploy safety is the intersection of Deployment strategy, PDB, and HPA, not any one of them alone.

Concrete Example: Blue-Green by Service Selector

Two deployments (api-blue, api-green), one Service. The swap is a one-line selector change:

apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
color: green # was "blue"; change + kubectl apply switches all traffic
ports:
- port: 80
targetPort: 8080

Rollback is editing color: green back to color: blue. Seconds. The "green" fleet must be kept warm (connection pools, JITs) before the switch -- cold-start spikes at swap-time are the most common blue-green bug.

Concrete Example: Canary with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: api

The paired AnalysisTemplate defines the promotion criterion -- typically a success-rate or p99-latency query against Prometheus -- which you must state explicitly (see concept 9 for rollback criteria). Flagger is the alternative operator, with a GitOps-friendly reconciler model; both end up at the same "step + analysis + abort" structure.

Common Confusion / Misconception

"Rolling update is the safest default." It is the cheapest default, not the safest. Rolling cannot abort cleanly once started; once new pods are serving traffic, a rollback is itself a rolling update in reverse.

"Blue-green means zero downtime." Only for stateless services where both fleets can run concurrently. For stateful systems (shared databases, long-lived connections, caches) blue-green requires careful coordination or it causes correlated errors at the swap. Long-lived websockets and gRPC streams are especially hostile to blue-green because they don't "drain" -- they hold on until the server forces a disconnect.

"Canary and blue-green are alternatives." They compose. A common pattern is canary into a green fleet: start by routing 5% of traffic to a green fleet, ramp up, then decommission blue. Google's SRE Workbook documents this as the standard large-scale rollout.

"All traffic-split deployments need a service mesh." Not required. Ingress controllers (NGINX, Traefik), cloud load balancers (ALB weighted target groups), and gateway APIs all support traffic splitting natively. A mesh (Istio, Linkerd) adds power but is not a prerequisite.

"Canary percentages are the control." They are one control. Per-user or per-region canaries (only internal users; only eu-west-1) are often better for risky changes because the blast radius is bounded even if percentages aren't respected evenly by hashing.

How To Use It

Pick by service shape:

  • Stateless internal service, low traffic -> rolling, done.
  • User-facing service where a bad deploy hurts real customers -> canary with automated rollback.
  • Service that cannot tolerate mixed versions (protocol breaking change, large DB migration) -> blue-green.
  • Regulated/batch systems with maintenance windows -> blue-green or in-place after window.

Whichever you choose, write down:

  1. the exact rollback command or action
  2. how long rollback takes
  3. who authorizes rollback at 3am (hint: it should be the on-call engineer, no approvals)
  4. the exact metric + threshold that would trigger an automatic rollback (see concept 9)

Check Yourself

  1. Give one case where blue-green is riskier than rolling, and one where it is safer.
  2. What two Kubernetes Deployment settings make a rolling update actually safe?
  3. Why is the rollback of a rolling update slower than the rollback of a blue-green?
  4. Which strategy lets you catch a bug at 1% of users?
  5. What does combining canary into a blue-green fleet give you that neither does alone?

Mini Drill or Application

Pick a service you know. Fill in this table:

QuestionAnswer
Current deploy strategy
Rollback command
Rollback wall-clock time
Who is authorized to roll back without approval
What automatic signal would trigger rollback

If any row is blank or "we'd figure it out," that is the gap to close before the next deploy.

See also (external)


Source Backbone

CI/CD behavior must be checked against official tool docs, but these books provide the durable release-engineering backbone.