Skip to main content

Blast Radius and Safe-by-Default Patterns

What This Concept Is

Blast radius = the set of resources (and data) a single terraform apply can destroy if it goes wrong. Small blast radius = one bad apply damages one service. Large blast radius = one bad apply destroys the company.

Terraform does not enforce blast radius; you do, through a combination of:

  • Stack decomposition -- multiple smaller state files instead of one giant one
  • Environment isolation -- per-env credentials, per-env state, per-env pipelines
  • Lifecycle rules -- prevent_destroy, create_before_destroy, ignore_changes
  • Targeted applies -- -target=... as a scalpel, not a default
  • Guardrails at the provider layer -- IAM deny policies, resource deletion protection

"Safe by default" means: when the reviewer is tired and the clock says 5:30 p.m. on a Friday, the defaults must protect them.

Why It Matters Here

Large-scope state files are where careers end. A single root module with 600 resources, run by a bot on every merge, is one typo away from a company-wide outage. The instinct "let's put everything in one repo" is fine; the instinct "let's put everything in one state file" is not.

Safe-by-default patterns are cheap to add at the start and expensive to retrofit after an incident. The cost-effective answer is always to design them in.

Concrete Example: prevent_destroy

resource "aws_db_instance" "primary" {
identifier = "acme-prod-primary"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
final_snapshot_identifier = "acme-prod-primary-final"

lifecycle {
prevent_destroy = true
}
}

prevent_destroy = true makes any plan that would destroy this resource fail, even if terraform destroy was invoked. To actually delete it, an engineer has to edit the code (remove the lifecycle block), review it, and apply -- a deliberate, auditable act.

This is especially valuable for:

  • production databases
  • DNS zones (losing a DNS zone loses all subdomain history)
  • route 53 records pointing to customer-facing endpoints
  • S3 buckets whose names are in third-party URLs

Concrete Example: Stack Decomposition

One big stack (large blast radius):

infra/
main.tf # VPC + RDS + ECS + IAM + S3 + ALB + Route53 + monitoring (one state)

One bad plan ≈ every production resource at risk.

Decomposed stacks (small blast radius):

infra/
stacks/
network/ # VPC + subnets + NAT (own state)
data/ # RDS + S3 + ElastiCache (own state, prevent_destroy on DB)
compute/ # ECS + ALB (own state, reads from network and data via remote_state)
monitoring/ # CloudWatch dashboards, alarms (own state)

Bad plans are contained. Applies to compute/ cannot accidentally delete the RDS in data/. The cost is coordination: if compute/ needs a subnet ID, it reads data.terraform_remote_state.network.outputs.subnet_ids.

Concrete Example: -target as a Scalpel

terraform plan -target=aws_security_group.api_public -out=tfplan
terraform apply tfplan

-target restricts the plan to a single resource (and its dependencies). Use it to:

  • recover from a partially-failed apply
  • roll out a narrow fix during an incident
  • test a single resource's creation in a new environment

Never use -target as a default workflow. It skips resources; it can leave state inconsistent with config; subsequent non-targeted plans will show unexpected diffs. -target is an emergency tool.

Concrete Example: create_before_destroy

By default, Terraform destroys the old resource, then creates the new one. For stateful replacements that is correct; for zero-downtime resources (ALBs, launch templates) it causes an outage window. Flip the order:

resource "aws_launch_template" "web" {
# ...

lifecycle {
create_before_destroy = true
}
}

Now a replacement creates the new launch template first, then destroys the old one. Use this pattern for:

  • launch templates
  • ALB target groups rotated into an ASG
  • certificates rotated into a listener

The Safe-by-Default Checklist

For any new root module:

  1. State is remote with locking (Cluster 3 Concept 09).
  2. Environments are isolated at the state level.
  3. Stateful resources (databases, DNS zones, KMS keys) have prevent_destroy.
  4. CI requires PR plans, human review for prod, and does not auto-apply to prod without explicit approval.
  5. No resource is referenced only by -target; if a resource needs target-only, it probably belongs in its own stack.
  6. Provider credentials are per-environment, narrowly scoped. The CI runner for staging cannot authenticate against prod.
  7. terraform destroy on prod requires multi-person approval or is impossible-by-design.

Common Confusion / Misconception

"prevent_destroy locks the resource forever." It prevents destruction via Terraform. An engineer can still remove the lifecycle block and reapply. It is a speed bump, not a seal.

"Stack decomposition means microservices for infra." Not quite. The right granularity is usually "one stack per operational boundary" -- network, data, per-service compute -- not "one stack per resource."

"create_before_destroy is always better." It only applies to resources that can coexist with a copy of themselves. Databases, most S3 buckets (name conflict), and fixed-name IAM roles cannot. For those, the correct answer is moved, import, or a planned migration.

"If plan approval is required, -target is fine as a regular workflow." No. -target hides resources from the plan, so the plan a reviewer approves does not match the total state mutation over time. Save it for incidents.

How To Use It

  1. Default on every stateful resource: lifecycle { prevent_destroy = true }. Remove only with a PR.
  2. Decompose from day one -- even if the first decomposition is just network vs everything else.
  3. Write the "what would need to happen to destroy prod?" answer down. If it is "one terraform destroy," you have work to do.
  4. Restrict who (and what IAM role) can run apply against prod. Most engineers should only apply to dev.
  5. Treat every new -target usage as a small incident: document why it was necessary and what would make it unnecessary next time.

Check Yourself

  1. Why does splitting one giant state into four smaller states usually reduce blast radius without meaningfully hurting cohesion?
  2. A plan shows -/+ on an aws_launch_template. Does create_before_destroy help, hurt, or have no effect? Why?
  3. Why is -target acceptable during an incident but not as a normal workflow?

Mini Drill or Application

Take a small existing Terraform repo (your own or an open-source example). In 20 minutes:

  • List every stateful resource (DB, bucket, DNS zone, KMS key) that is not protected by prevent_destroy.
  • Write the one-line PR description for a change that adds prevent_destroy to each, grouped by stack.
  • Identify one resource that would benefit from create_before_destroy and explain why.

This is a real review skill; senior engineers do it reflexively when they open a repo for the first time.

See also (external)


Source Backbone

Infrastructure-as-code details are tool-specific, but these local books provide the operational backbone for shell, Git, and change discipline.