Skip to main content

Progressive Delivery and Rollback Discipline

What This Concept Is

Progressive delivery is the umbrella term for rollouts that expose a change to increasing portions of traffic while verifying it against real metrics at each step. Canary is one instance; traffic splits across regions or user segments are others. The defining property: every step has a measurable go/no-go criterion, and the no-go path is automated. The term was coined by James Governor (RedMonk, ~2018) to generalize canary + feature flag + shadow traffic under one name.

Rollback discipline is the other half. Every change has:

  1. a rollback trigger -- the specific metric and threshold that says "abort"
  2. a rollback action -- the exact mechanism to revert, tested and fast
  3. a rollback authority -- who can invoke it without asking permission
  4. a rollback deadline -- wall-clock time within which rollback must be possible

Any rollout without all four is a bet, not a deployment.

Why It Matters Here

Progressive delivery is the practical payoff of everything in this module:

  • small batches (concept 1) make each progressive step meaningful
  • build-once (concept 4) means the rollback artifact is already in production's artifact store
  • feature flags (concept 8) give you sub-second reversibility for behavior changes
  • canary strategy (concept 7) gives you traffic-level reversibility for infra changes
  • observability (concept 15) provides the signals that drive the automated go/no-go

Without automation, progressive delivery is a human staring at dashboards at 3am, which has roughly the same failure mode as any other manual process. Google's SRE Workbook is explicit: the single biggest lever on MTTR is automating the detect-and-revert loop, because humans detect slowly.

Concrete Example: Rollout with Explicit Rollback Criteria

Service: a payments API. Rollout config (Flagger-style):

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payments-api
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-api
service:
port: 8080
analysis:
interval: 1m
threshold: 5 # fail after 5 failing intervals
maxWeight: 50
stepWeight: 5
metrics:
- name: request-success-rate
thresholdRange: { min: 99.5 } # % successful
interval: 1m
- name: request-duration-p95
thresholdRange: { max: 400 } # ms
interval: 1m
webhooks:
- name: smoke-test
url: http://smoke-runner/run
timeout: 30s

Rollback criteria, stated plainly:

  • success rate drops below 99.5% for 5 consecutive minutes, OR
  • p95 latency exceeds 400ms for 5 consecutive minutes, OR
  • the post-step smoke-test webhook fails

Rollback action: Flagger shifts traffic weight back to the stable deployment, scales down the canary, and reports failure. No human in the loop.

Rollback deadline: 5 minutes per metric; the full rollout pauses-or-aborts within ~5 minutes of a breach.

Error Budget Burn as a Rollback Trigger

A more sophisticated rollback signal, drawn from Google SRE practice, is error-budget burn rate. If the service has an SLO of 99.9% success and a 30-day budget, the budget is being consumed at some rate per minute. A canary that is burning budget 10x faster than baseline is violating the SLO regardless of absolute error rate. The SRE Workbook defines multi-window, multi-burn-rate alerts that are the canonical shape:

  • fast window (5 min) + high burn threshold (14.4x) -> page / abort fast for acute breakage
  • slow window (1 hour) + medium burn (6x) -> ticket / abort for slow leaks

Using burn rates instead of raw error rates avoids the two classic failure modes: (1) thresholds that fire on baseline flakiness, (2) thresholds that never fire because they're too loose for the traffic level.

Common Confusion / Misconception

"Our rollback plan is 'revert the PR and redeploy.'" That takes 10-30 minutes in a warm pipeline, longer when things are on fire. That is not a rollback plan for production; it is a hotfix plan. Rollback should be reverting to the previously running artifact that is still available in the registry and deployable in seconds.

"We watch the dashboards during deploys." Human observation is the slowest signal in the system. By the time a human notices, thousands of requests have failed. Automate the thresholds.

"Rollback triggers should be conservative to avoid false alarms." A false-positive rollback is cheap: you redeploy. A false-negative (not rolling back when you should have) is an incident. Tune toward sensitivity.

"We can always fix-forward." Sometimes, yes. But "fix-forward only" cultures ship the fix untested at 3am because rollback was treated as an escape hatch instead of a first-class path. Rollback should be the default; fix-forward is the option for cases where the previous state is genuinely broken too (e.g. a migration has mutated data that the old code cannot read).

"Progressive delivery is overkill for a small service." Maybe. But the rollback discipline is not. Even a tiny service deserves a documented rollback command and a named trigger.

"Automated rollback is scary because it might loop." Good rollout controllers back off after an abort: they don't immediately retry the canary with the same bits. The loop-prevention is "fail the rollout; wait for a new commit."

How To Use It

For any production rollout, write down on one page:

  1. Strategy. Rolling, blue-green, canary, or flag-first.
  2. Success metrics. Specific queries against observability: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) < 0.005.
  3. Progression. The exact weights and wait times: 5% -> wait 10m -> 25% -> wait 15m -> 50% -> wait 30m -> 100%.
  4. Rollback trigger. Mirrored against the success metric: breach of threshold for N minutes, or burn-rate above the multi-window alert.
  5. Rollback action. The exact command, tested in staging.
  6. Rollback owner. On-call engineer. No approval chain.
  7. Rollback deadline. How fast rollback must happen. Usually < 5 minutes end-to-end.

Check Yourself

  1. Name the four parts of a rollback plan.
  2. Why is "revert and redeploy" usually not a real rollback plan for production?
  3. What does it mean to tune rollback triggers toward sensitivity vs specificity, and which is usually correct?
  4. Give one metric and threshold you would set for a user-facing web API canary.
  5. Why is error-budget burn rate a better trigger than raw error rate for a high-SLO service?

Mini Drill or Application

For a real service, write a one-page rollout plan covering the seven items above. Include:

  • the exact PromQL / Datadog / CloudWatch query for the success metric
  • the exact kubectl / argo rollouts / platform command for rollback
  • a test you could run in staging to verify the rollback actually works
  • a specific burn-rate threshold if the service has an SLO

If you cannot write the PromQL query, your observability is the bottleneck, not your rollout tooling.

See also (external)


Source Backbone

CI/CD behavior must be checked against official tool docs, but these books provide the durable release-engineering backbone.