Remote State, Locking, and Team Safety
What This Concept Is
Two operational guarantees Terraform relies on when more than one person (or one CI runner) touches the same stack.
Remote state -- the terraform.tfstate file lives on a shared backend (S3, GCS, Azure Blob, HCP Terraform, Postgres, Consul), not on a laptop. Everyone runs against the same state, so state evolves once per apply regardless of who triggered it.
State locking -- before a mutating operation (apply, plan -refresh-only, state commands), Terraform acquires an exclusive lock on the state. If another operation is in progress, the new one fails fast with a clear message. When the first apply finishes, the lock is released.
Remote state without locking is half the picture. Two concurrent applies against the same remote state still race.
Why It Matters Here
The horror story in Cluster 1 Concept 02 was preventable with a single remote backend + lock. Almost every "Terraform ate production" story has the same root cause: unlocked concurrent access to state.
Locking also matters for:
- CI safety -- two PRs merge in the same minute; without locks, both apply simultaneously
- Long-running applies -- a teammate starts a 10-minute apply; you need a clear error, not a silent race, if you try to plan
- Crashed runs -- a killed apply leaves a stale lock;
force-unlockexists but should be rare and audited
Concrete Example
S3 + DynamoDB backend (the classic AWS setup):
terraform {
backend "s3" {
bucket = "acme-tfstate-prod"
key = "envs/prod/infra.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
dynamodb_table = "acme-tf-locks"
}
}
What each line does:
bucket-- the S3 bucket storing state. Enable versioning on this bucket so you can recover from corruption.key-- the path inside the bucket. Each environment has its own key; do not reuse.encrypt+kms_key_id-- server-side encryption at rest. Terraform state can contain secrets; encryption is not optional.dynamodb_table-- a table with a primary key namedLockID. Terraform uses it as a lock registry.
HCP Terraform / Terraform Cloud backend:
terraform {
cloud {
organization = "acme"
workspaces {
name = "infra-prod"
}
}
}
HCP Terraform includes locking and state storage as managed services; you trade explicit config for implicit behavior.
What Happens During a Locked Apply
Ordered steps when you run terraform apply:
- Terraform calls the backend to acquire a lock on the state key.
- If the lock is held, it retries with backoff; after the configured timeout, it errors:
Error acquiring the state lock. - On success, it reads the current state from the remote backend.
- It refreshes state from the cloud provider and computes the plan.
- It executes provider API calls for the plan.
- It writes the new state back to the backend.
- It releases the lock.
If step 5 fails mid-way, the partial state is still written (with whatever succeeded) and the lock is released. Never kill a terraform apply with SIGKILL; you risk state+lock corruption. Use Ctrl+C once and let the graceful shutdown finish.
What a Corrupted State File Costs
Typical recovery sequence after state corruption or loss:
- Panic: several minutes of "wait, is production safe right now?"
- Decide whether to restore from S3 versioning, a backup, or an HCP Terraform restore point.
- Compare restored state against reality with
terraform plan. Expect to find differences. - Possibly run
terraform importon resources that ended up in the cloud but not in state. - Document the incident, including whether locking would have prevented it.
Teams that have lived through this once invariably turn on S3 versioning and move to HCP Terraform or add lifecycle rules to preserve state backups for 90+ days.
Common Confusion / Misconception
"Locking is automatic so I don't need to configure it." Locking is automatic if the backend supports it and you have configured the lock store (e.g., the DynamoDB table). Local backend does not lock; S3 without DynamoDB does not lock.
"I can force-unlock anytime." terraform force-unlock <LOCK_ID> is meant for a lock held by a dead process. Using it to clear a lock held by an active teammate's apply is exactly the race you were trying to avoid. Reach out before forcing.
"Remote state is just a backup." Remote state is the live state. Terraform reads from it and writes to it on every operation. Local copies are derivatives.
"Encryption at rest is paranoid." No -- state contains provider outputs including database passwords, TLS private keys, API tokens, and OIDC assertions. Encrypt at rest and restrict read access with IAM.
How To Use It
- Day one of any new repo, configure a remote backend with locking. Add local-state-in-git to
.gitignore. - Use one state key per environment. Never share state keys across
dev/staging/prod. - Turn on object-store versioning (S3 versioning, GCS object versioning) for the state bucket.
- Give your CI runners their own narrowly-scoped IAM role that can read/write only the state keys for its environment.
- When a teammate shouts "I can't acquire the lock," do not jump to
force-unlock. Find out who owns it first.
Check Yourself
- Why does the S3 backend use DynamoDB alongside S3?
- What is the blast radius of two unlocked concurrent applies against the same state?
- When is
terraform force-unlockthe right call, and when is it a bug masquerading as a fix?
Mini Drill or Application
Configure an S3 backend with DynamoDB locking for any sandbox stack. Then:
- run
terraform applyand, in another terminal, immediately tryterraform plan - observe the lock-contention error
- wait for the first apply to finish; retry the second command; it should succeed
If you are on a cost-sensitive account, do this with localstack or any LocalStack-compatible mock. The point is to see the lock error, not spend AWS dollars.
See also (external)
- Terraform Language: Backend block -- backend types, configuration,
partialconfiguration, and howterraform inithandles backend changes. - Terraform Language: Remote state -- sharing data across stacks via remote state and the
terraform_remote_statedata source.
Source Backbone
Infrastructure-as-code details are tool-specific, but these local books provide the operational backbone for shell, Git, and change discipline.
- Pro Git - versioned infrastructure changes, branching, review, and rollback habits.
- Git from the Bottom Up - mental model for stateful change history.
- The Linux Command Line - shell and automation grounding for infrastructure work.