Blast Radius and Safe-by-Default Patterns
What This Concept Is
Blast radius = the set of resources (and data) a single terraform apply can destroy if it goes wrong. Small blast radius = one bad apply damages one service. Large blast radius = one bad apply destroys the company.
Terraform does not enforce blast radius; you do, through a combination of:
- Stack decomposition -- multiple smaller state files instead of one giant one
- Environment isolation -- per-env credentials, per-env state, per-env pipelines
- Lifecycle rules --
prevent_destroy,create_before_destroy,ignore_changes - Targeted applies --
-target=...as a scalpel, not a default - Guardrails at the provider layer -- IAM deny policies, resource deletion protection
"Safe by default" means: when the reviewer is tired and the clock says 5:30 p.m. on a Friday, the defaults must protect them.
Why It Matters Here
Large-scope state files are where careers end. A single root module with 600 resources, run by a bot on every merge, is one typo away from a company-wide outage. The instinct "let's put everything in one repo" is fine; the instinct "let's put everything in one state file" is not.
Safe-by-default patterns are cheap to add at the start and expensive to retrofit after an incident. The cost-effective answer is always to design them in.
Concrete Example: prevent_destroy
resource "aws_db_instance" "primary" {
identifier = "acme-prod-primary"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
final_snapshot_identifier = "acme-prod-primary-final"
lifecycle {
prevent_destroy = true
}
}
prevent_destroy = true makes any plan that would destroy this resource fail, even if terraform destroy was invoked. To actually delete it, an engineer has to edit the code (remove the lifecycle block), review it, and apply -- a deliberate, auditable act.
This is especially valuable for:
- production databases
- DNS zones (losing a DNS zone loses all subdomain history)
- route 53 records pointing to customer-facing endpoints
- S3 buckets whose names are in third-party URLs
Concrete Example: Stack Decomposition
One big stack (large blast radius):
infra/
main.tf # VPC + RDS + ECS + IAM + S3 + ALB + Route53 + monitoring (one state)
One bad plan ≈ every production resource at risk.
Decomposed stacks (small blast radius):
infra/
stacks/
network/ # VPC + subnets + NAT (own state)
data/ # RDS + S3 + ElastiCache (own state, prevent_destroy on DB)
compute/ # ECS + ALB (own state, reads from network and data via remote_state)
monitoring/ # CloudWatch dashboards, alarms (own state)
Bad plans are contained. Applies to compute/ cannot accidentally delete the RDS in data/. The cost is coordination: if compute/ needs a subnet ID, it reads data.terraform_remote_state.network.outputs.subnet_ids.
Concrete Example: -target as a Scalpel
terraform plan -target=aws_security_group.api_public -out=tfplan
terraform apply tfplan
-target restricts the plan to a single resource (and its dependencies). Use it to:
- recover from a partially-failed apply
- roll out a narrow fix during an incident
- test a single resource's creation in a new environment
Never use -target as a default workflow. It skips resources; it can leave state inconsistent with config; subsequent non-targeted plans will show unexpected diffs. -target is an emergency tool.
Concrete Example: create_before_destroy
By default, Terraform destroys the old resource, then creates the new one. For stateful replacements that is correct; for zero-downtime resources (ALBs, launch templates) it causes an outage window. Flip the order:
resource "aws_launch_template" "web" {
# ...
lifecycle {
create_before_destroy = true
}
}
Now a replacement creates the new launch template first, then destroys the old one. Use this pattern for:
- launch templates
- ALB target groups rotated into an ASG
- certificates rotated into a listener
The Safe-by-Default Checklist
For any new root module:
- State is remote with locking (Cluster 3 Concept 09).
- Environments are isolated at the state level.
- Stateful resources (databases, DNS zones, KMS keys) have
prevent_destroy. - CI requires PR plans, human review for prod, and does not auto-apply to prod without explicit approval.
- No resource is referenced only by
-target; if a resource needs target-only, it probably belongs in its own stack. - Provider credentials are per-environment, narrowly scoped. The CI runner for staging cannot authenticate against prod.
terraform destroyon prod requires multi-person approval or is impossible-by-design.
Common Confusion / Misconception
"prevent_destroy locks the resource forever." It prevents destruction via Terraform. An engineer can still remove the lifecycle block and reapply. It is a speed bump, not a seal.
"Stack decomposition means microservices for infra." Not quite. The right granularity is usually "one stack per operational boundary" -- network, data, per-service compute -- not "one stack per resource."
"create_before_destroy is always better." It only applies to resources that can coexist with a copy of themselves. Databases, most S3 buckets (name conflict), and fixed-name IAM roles cannot. For those, the correct answer is moved, import, or a planned migration.
"If plan approval is required, -target is fine as a regular workflow." No. -target hides resources from the plan, so the plan a reviewer approves does not match the total state mutation over time. Save it for incidents.
How To Use It
- Default on every stateful resource:
lifecycle { prevent_destroy = true }. Remove only with a PR. - Decompose from day one -- even if the first decomposition is just
networkvseverything else. - Write the "what would need to happen to destroy prod?" answer down. If it is "one
terraform destroy," you have work to do. - Restrict who (and what IAM role) can run apply against prod. Most engineers should only apply to dev.
- Treat every new
-targetusage as a small incident: document why it was necessary and what would make it unnecessary next time.
Check Yourself
- Why does splitting one giant state into four smaller states usually reduce blast radius without meaningfully hurting cohesion?
- A plan shows
-/+on anaws_launch_template. Doescreate_before_destroyhelp, hurt, or have no effect? Why? - Why is
-targetacceptable during an incident but not as a normal workflow?
Mini Drill or Application
Take a small existing Terraform repo (your own or an open-source example). In 20 minutes:
- List every stateful resource (DB, bucket, DNS zone, KMS key) that is not protected by
prevent_destroy. - Write the one-line PR description for a change that adds
prevent_destroyto each, grouped by stack. - Identify one resource that would benefit from
create_before_destroyand explain why.
This is a real review skill; senior engineers do it reflexively when they open a repo for the first time.
See also (external)
- Terraform Language:
resourceblock reference (lifecycle) --prevent_destroy,create_before_destroy,ignore_changes, andreplace_triggered_by. - Terraform Best Practices: Stack decomposition -- per-environment and per-stack layouts with data-sharing via remote state.
Source Backbone
Infrastructure-as-code details are tool-specific, but these local books provide the operational backbone for shell, Git, and change discipline.
- Pro Git - versioned infrastructure changes, branching, review, and rollback habits.
- Git from the Bottom Up - mental model for stateful change history.
- The Linux Command Line - shell and automation grounding for infrastructure work.