Skip to main content

State: The Ground Truth and Its Hazards

What This Concept Is

Terraform does not infer the world from your cloud APIs on every run. It maintains a state file (terraform.tfstate) that maps each resource address in your config (e.g., aws_instance.web) to the real-world object it manages (e.g., instance i-0abc...), along with attributes and metadata.

State exists because:

  • cloud resources have provider-assigned IDs you did not write in code
  • a full API scan of a large account is too slow to do on every plan
  • some attributes (passwords, sensitive outputs) only exist at create-time and cannot be re-read
  • performance: state caches so plan can do a diff instead of a scan

State is therefore the source of truth Terraform operates on. If you lose it, Terraform no longer knows what it manages, even if all the resources still exist in the cloud.

Why It Matters Here

Almost every "scary Terraform story" is a state story:

  • two engineers run apply at the same time; state is interleaved
  • someone edits state by hand and breaks the schema
  • a laptop dies with the only copy of terraform.tfstate on it
  • state is committed to git by accident, leaking secrets
  • a stale state thinks resources exist that were deleted out-of-band

Understanding state is the difference between "I use Terraform" and "I operate Terraform in a team." Cluster 3 covers remote backends and locking, which exist specifically to tame these failure modes.

Concrete Example

A tiny state file excerpt (JSON, truncated):

{
"version": 4,
"terraform_version": "1.7.4",
"serial": 42,
"lineage": "4b2a...",
"resources": [
{
"mode": "managed",
"type": "aws_s3_bucket",
"name": "artifacts",
"provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
"instances": [{
"attributes": {
"id": "acme-artifacts-prod",
"arn": "arn:aws:s3:::acme-artifacts-prod",
"tags": { "env": "prod" }
}
}]
}
]
}

Three things worth noticing:

  • the id field is how Terraform rediscovers the real bucket on the next refresh
  • serial and lineage let backends detect races and resumed operations
  • attributes can include secrets (database passwords, TLS keys) -- this is why state must be treated as sensitive

Backend options at a glance. The backend block controls where state is stored.

  • local (default): terraform.tfstate in the working directory. Fine for solo sandboxes; unsafe for teams.
  • S3 + DynamoDB / Google Cloud Storage / Azure Blob: cloud-hosted object store for state, with a lock table for exclusivity.
  • HCP Terraform / Terraform Cloud: managed backend with built-in locking, runs, RBAC, and Sentinel policy.

A Horror Story

An early-stage team ran Terraform from engineer laptops with state committed to git. One Friday afternoon, two pull requests merged within the same hour. Both had been rebased against an older main. Each merge auto-ran terraform apply from a CI worker that pulled the now-ambiguous state. The later apply saw "unknown" resources and decided to delete an RDS instance, a VPC, and three Lambda functions to converge to its view of state. Recovery took 14 hours and required a point-in-time RDS restore. The root cause was not carelessness; it was a local backend with no locking, used by a team.

The fix was one backend "s3" { ... } block with a DynamoDB lock table and a one-line CI change to serialize applies. Locking would have blocked the second apply with an error instead of destroying production.

Common Confusion / Misconception

"Terraform re-reads the cloud every run, so state is just a cache." Partly true -- refresh updates attribute values -- but state is also the authoritative map between config addresses and real-world IDs. Without state there is no aws_instance.web -> i-0abc mapping.

"I can edit the state file by hand." Almost never. If you must, use terraform state mv / rm / import. Hand-editing JSON will work for small fixes and then silently break three weeks later.

"Committing state to git is fine if the repo is private." Private repos still have dozens of collaborators, fork histories, and backups. Secrets in state end up in GitHub's search index anyway. Use a remote backend.

How To Use It

  • Choose a remote backend on day one, not after the first incident.
  • Turn on state locking (every managed backend supports it; see Concept 09).
  • Never commit terraform.tfstate to git. Add it to .gitignore in every repo template.
  • Treat state as sensitive: encrypt at rest, restrict read access.
  • Back up state. Remote backends with object-store versioning (S3, GCS) give this for free; use it.

Check Yourself

  1. Why can you not "just recreate state" by scanning your cloud account?
  2. What part of a terraform plan run reads the state file versus the real cloud?
  3. You join a team whose state file lives on one engineer's laptop. Name three things you change in week one.

Mini Drill or Application

Set up a tiny Terraform config locally (one null_resource or one tagged bucket). Open terraform.tfstate after apply. Identify:

  • the serial and lineage fields
  • the resource id your provider assigned
  • one attribute you did not write in the config

Then change the ami or an equivalent argument, run terraform plan, and point at which parts of the plan come from state versus from refreshing the cloud API.

See also (external)


Source Backbone

Infrastructure-as-code details are tool-specific, but these local books provide the operational backbone for shell, Git, and change discipline.