Remote State, Locking, and Team Safety

What This Concept Is

Two operational guarantees Terraform relies on when more than one person (or one CI runner) touches the same stack.

Remote state -- the terraform.tfstate file lives on a shared backend (S3, GCS, Azure Blob, HCP Terraform, Postgres, Consul), not on a laptop. Everyone runs against the same state, so state evolves once per apply regardless of who triggered it.

State locking -- before a mutating operation (apply, plan -refresh-only, state commands), Terraform acquires an exclusive lock on the state. If another operation is in progress, the new one fails fast with a clear message. When the first apply finishes, the lock is released.

Remote state without locking is half the picture. Two concurrent applies against the same remote state still race.

Why It Matters Here

The horror story in Cluster 1 Concept 02 was preventable with a single remote backend + lock. Almost every "Terraform ate production" story has the same root cause: unlocked concurrent access to state.

Locking also matters for:

CI safety -- two PRs merge in the same minute; without locks, both apply simultaneously
Long-running applies -- a teammate starts a 10-minute apply; you need a clear error, not a silent race, if you try to plan
Crashed runs -- a killed apply leaves a stale lock; force-unlock exists but should be rare and audited

Concrete Example

S3 + DynamoDB backend (the classic AWS setup):

terraform {
  backend "s3" {
    bucket         = "acme-tfstate-prod"
    key            = "envs/prod/infra.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
    dynamodb_table = "acme-tf-locks"
  }
}

What each line does:

bucket -- the S3 bucket storing state. Enable versioning on this bucket so you can recover from corruption.
key -- the path inside the bucket. Each environment has its own key; do not reuse.
encrypt + kms_key_id -- server-side encryption at rest. Terraform state can contain secrets; encryption is not optional.
dynamodb_table -- a table with a primary key named LockID. Terraform uses it as a lock registry.

HCP Terraform / Terraform Cloud backend:

terraform {
  cloud {
    organization = "acme"
    workspaces {
      name = "infra-prod"
    }
  }
}

HCP Terraform includes locking and state storage as managed services; you trade explicit config for implicit behavior.

What Happens During a Locked Apply

Ordered steps when you run terraform apply:

Terraform calls the backend to acquire a lock on the state key.
If the lock is held, it retries with backoff; after the configured timeout, it errors: Error acquiring the state lock.
On success, it reads the current state from the remote backend.
It refreshes state from the cloud provider and computes the plan.
It executes provider API calls for the plan.
It writes the new state back to the backend.
It releases the lock.

If step 5 fails mid-way, the partial state is still written (with whatever succeeded) and the lock is released. Never kill a terraform apply with SIGKILL; you risk state+lock corruption. Use Ctrl+C once and let the graceful shutdown finish.

What a Corrupted State File Costs

Typical recovery sequence after state corruption or loss:

Panic: several minutes of "wait, is production safe right now?"
Decide whether to restore from S3 versioning, a backup, or an HCP Terraform restore point.
Compare restored state against reality with terraform plan. Expect to find differences.
Possibly run terraform import on resources that ended up in the cloud but not in state.
Document the incident, including whether locking would have prevented it.

Teams that have lived through this once invariably turn on S3 versioning and move to HCP Terraform or add lifecycle rules to preserve state backups for 90+ days.

Common Confusion / Misconception

"Locking is automatic so I don't need to configure it." Locking is automatic if the backend supports it and you have configured the lock store (e.g., the DynamoDB table). Local backend does not lock; S3 without DynamoDB does not lock.

"I can force-unlock anytime." terraform force-unlock <LOCK_ID> is meant for a lock held by a dead process. Using it to clear a lock held by an active teammate's apply is exactly the race you were trying to avoid. Reach out before forcing.

"Remote state is just a backup." Remote state is the live state. Terraform reads from it and writes to it on every operation. Local copies are derivatives.

"Encryption at rest is paranoid." No -- state contains provider outputs including database passwords, TLS private keys, API tokens, and OIDC assertions. Encrypt at rest and restrict read access with IAM.

How To Use It

Day one of any new repo, configure a remote backend with locking. Add local-state-in-git to .gitignore.
Use one state key per environment. Never share state keys across dev/staging/prod.
Turn on object-store versioning (S3 versioning, GCS object versioning) for the state bucket.
Give your CI runners their own narrowly-scoped IAM role that can read/write only the state keys for its environment.
When a teammate shouts "I can't acquire the lock," do not jump to force-unlock. Find out who owns it first.

Check Yourself

Why does the S3 backend use DynamoDB alongside S3?
What is the blast radius of two unlocked concurrent applies against the same state?
When is terraform force-unlock the right call, and when is it a bug masquerading as a fix?

Mini Drill or Application

Configure an S3 backend with DynamoDB locking for any sandbox stack. Then:

run terraform apply and, in another terminal, immediately try terraform plan
observe the lock-contention error
wait for the first apply to finish; retry the second command; it should succeed

If you are on a cost-sensitive account, do this with localstack or any LocalStack-compatible mock. The point is to see the lock error, not spend AWS dollars.

Source Backbone

Infrastructure-as-code details are tool-specific, but these local books provide the operational backbone for shell, Git, and change discipline.

Pro Git - versioned infrastructure changes, branching, review, and rollback habits.
Git from the Bottom Up - mental model for stateful change history.
The Linux Command Line - shell and automation grounding for infrastructure work.

What This Concept Is​

Why It Matters Here​

Concrete Example​

What Happens During a Locked Apply​

What a Corrupted State File Costs​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

See also (external)​

Source Backbone​