Skip to main content

Rollback Rehearsal: First Prod Deploy Fails

What This Concept Is

A rollback rehearsal is a deliberate deploy of a known-bad build or known-bad migration to prod (or a prod-like env), followed by a timed return to the previous healthy state. You do it on purpose, with a witness or a recording, at least once before the real incident.

A rehearsal produces three artifacts:

  • a written trigger list: the conditions under which you will roll back without discussion
  • a timer: how many minutes from "trigger observed" to "prod back to last healthy version"
  • a receipt: the git commit SHA or pipeline run ID that represents the successful revert, stored in library/raw/rollback-rehearsals/NNN.md

Rehearsal is not a thought experiment. It is a real deploy, a real rollback, and a real recorded time. The discipline lives in the writing, not in the doing alone.

Why It Matters Here (In the Capstone)

Every capstone has a plan to roll back. Almost none have actually executed one. The first time you try is always under pressure, and the pipeline will have a subtle gap you did not know about -- an unversioned image tag, a manual console change in staging that never made it to prod, a smoke test that only runs on success. Rehearsing surfaces those gaps cheaply, when the blast radius is zero.

DORA's research ties "failed deployment recovery time" to engineering performance directly. Your capstone defense benefits from a demonstrated MTTR, not a theoretical one. "Our rollback takes 4 minutes, rehearsed 2026-05-03" is a strong answer; "we have a rollback plan" is a weak one.

Concrete Example(s)

A realistic rollback trigger list for a solo capstone:

TriggerHow observedTimer budget
5xx rate > 5% for 2 minutes post-deploydashboard or smoke test5 minutes
Smoke test fails in pipelinepipeline red on deploy job3 minutes
Login broken (manual check)you10 minutes
p99 latency > 2x baseline for 5 minutesdashboard alert10 minutes
Data corruption reportedyoudo not roll back code alone; see migration concept

The mechanism, preferred to least-preferred:

  • Redeploy previous image tag via pipeline. Deterministic, auditable, same code path as a normal deploy -- just pointed at an older tag.
  • Click "Rollback" on the cloud service console (Cloud Run revisions, AWS App Runner, Fly releases, ECS service revisions). Fast, but not reflected in git until you commit the revert manually.
  • git revert + wait for pipeline. Slow, and you are typing and reviewing a PR under stress. Only when the two above are unavailable.

A rehearsal script:

# Deliberately break prod with a known-bad image tag.
gh workflow run deploy.yml -f image_tag=v0.0.0-bad
# Wait for smoke test to fail. Start the wall clock.

# Run the rollback.
gh workflow run deploy.yml -f image_tag=v1.2.3

# Stop the clock when smoke test passes again.

Or the Cloud Run console-equivalent, as a backup path:

# Route 100% of traffic to the previous revision.
gcloud run services update-traffic capstone-api \
--region us-central1 --to-revisions=capstone-api-00042-a1b=100

Record the total minutes, the path taken, the command(s) actually run, and any gap discovered in library/raw/rollback-rehearsals/001.md.

Common Confusion / Misconceptions

  • "If tests pass, we don't need a rollback plan." Tests are a filter, not a certainty. A deploy can fail for reasons tests cannot see: a cloud quota, a DNS misconfiguration, a missing secret, a schema migration that succeeded in staging against a small dataset and times out in prod. Rollback is the contract you keep with users when the filter misses.
  • "Rolling forward is always better than rolling back." Roll-forward is correct when the fix is certain and small. When you are unsure, rolling back buys time to think without users continuing to hit the bad state. Default to rollback; earn the right to roll-forward by being specific about the fix.
  • "The cloud's auto-rollback feature removes the need for a rehearsal." Auto-rollback (ECS blue/green, Cloud Run traffic splitting) handles the mechanics of traffic swap. It does not handle the decision (which trigger, which budget), the compensation for side effects, or the release-note that explains what happened.
  • "Rollback fixes migrations." It does not -- see concept 11. Many rollbacks require a corresponding forward migration to restore data shape; treat any rollback across a migration boundary as a separate, planned event.
  • "The timer is just a vanity number." The timer is a forcing function. If it is >15 minutes, your rollback path has a gap worth fixing. If it is <2 minutes, you may have automated yourself into blast radius without thinking.

How To Use It (In Your Capstone)

  1. Write the trigger list before the first real deploy. Commit it as part of the runbook.
  2. Commit library/raw/rollback-rehearsals/ and write rehearsal #001 within two days of the first prod deploy.
  3. Use your pipeline's rollback path, not the cloud console, as the primary mechanism; treat the console as a fallback.
  4. Rerun the rehearsal after any change that would affect the rollback path (new Terraform module, new env var, new migration type, new cloud service).
  5. Keep the timer honest. If the real number is 40 minutes, write 40. Shortening it on paper only hurts you.
  6. Enumerate compensating actions for side effects (emails sent, webhooks fired, payments taken) -- rollback does not unsend them.
  7. Add rollback to the runbook (concept 15) so a stranger can execute it.

The Rehearsal-vs-Game-Day Spectrum

A rollback rehearsal sits on the low-end of the chaos-engineering spectrum (see S8 M04 on game days). You are not injecting random failure; you are exercising a known path on a schedule. Start there. Once the rehearsal runs cleanly under 5 minutes three times in a row, graduate to a mini game-day: break something with more ambiguity (misconfigured secret, quota breach) and watch how your triggers and runbook behave.

The capstone does not require full chaos engineering. It does require that the rollback path is exercised at least once, in writing, with a timer, before a reviewer asks about MTTR.

The Two Things a Rollback Cannot Fix

  • Data changes that have already been persisted. Users who submitted forms during the bad window have their data committed. Rolling back code does not undo their writes, and may leave them unable to read what they wrote. See the expand-contract concept for how to avoid this.
  • External side effects. Emails sent, webhooks fired, payments taken, third-party APIs called. Rollback does not unsend a webhook. Plan a compensating action (void, refund, apology email) for any side-effecting endpoint that could have fired during the bad window.

If your trigger list only mentions "the app," it is missing the half of the system that users actually care about.

See also (integrative)

Check Yourself

  1. What is the command you run right now to roll back? If there are two, which is primary?
  2. How many minutes did your last rehearsal take, and what was the dominant step?
  3. Which trigger would you alone catch in under 5 minutes, without a paging tool?
  4. What compensating action would you run after rolling back a side-effecting endpoint?
  5. Does your rollback path work for a migration-gated release, or is that a separate path?
  6. When was the rollback path last changed, and has it been re-rehearsed since?

Mini Drill or Application (Capstone-scoped)

  1. 30-minute rehearsal. On staging or a preview env, intentionally deploy a bad build. Time the rollback. Write library/raw/rollback-rehearsals/001.md with timestamps to the minute. Do not skip writing the receipt -- the writing is the practice.
  2. Trigger list commit. Commit the rollback trigger table as a section in RUNBOOK.md. Read it out loud at a reasonable pace; anything that takes >10 seconds to parse gets rewritten.
  3. Side-effect census. List every endpoint in your capstone that has a user-visible side effect (email, webhook, payment). For each, write the one-line compensating action. Missing rows mean an incomplete rollback plan.

Source Backbone

Capstone deployment applies cloud, delivery, and operations material. These books are the source backbone for the delivery decisions.