Schema Migrations: Adding/Removing Columns, Online Migrations

What This Concept Is

A schema migration is a change to the database's structure: a new column, a new constraint, a renamed table, a split table, a merged table. Production migrations are run against live data with live traffic and therefore have two axes you cannot ignore:

Locking behavior: which statements block reads or writes on which objects, for how long?
Compatibility window: is the application able to run both before and after the migration? Is there a version of the application that runs correctly against both the old and the new schema?

The online migration pattern is:

Expand: add the new shape without removing the old. Both shapes coexist.
Backfill: populate the new shape from the old, in small batches, without locking the table.
Cut over: flip the application to read and write the new shape (often via a feature flag or deployment).
Contract: drop the old shape.

Each step is independently reversible. Each step runs while the system serves traffic.

Why It Matters Here

the moment your database has non-trivial data, you cannot ALTER TABLE ... ADD NOT NULL DEFAULT X without thinking
schema changes that look identical differ sharply in cost: PostgreSQL adding a nullable column with no default is metadata-only and fast; adding NOT NULL DEFAULT expensive_function() used to rewrite the entire table (improved in PG 11+ but still a footgun)
every irreversible migration is a potential outage; expand/backfill/contract makes everything reversible

Concrete Example

Goal: add customer.country (non-null, derived from billing_address.country), to a 200 M-row table without downtime.

Expand:

ALTER TABLE customer ADD COLUMN country CHAR(2) NULL;
-- Metadata-only in modern PostgreSQL. Immediate.

Deploy application version N+1 that writes both old and new columns (dual-write).

Backfill in chunks:

-- Run in a loop until 0 rows affected, outside the main transaction.
UPDATE customer
SET country = ba.country_code
FROM billing_address ba
WHERE ba.customer_id = customer.id
  AND customer.country IS NULL
  AND customer.id IN (
      SELECT id FROM customer WHERE country IS NULL LIMIT 10000
  );

Each batch is small enough to commit without holding long locks. When SELECT COUNT(*) FROM customer WHERE country IS NULL hits 0, move on.

Cut over:

-- Add a NOT VALID constraint (cheap; validates without a full scan).
ALTER TABLE customer
    ADD CONSTRAINT customer_country_not_null CHECK (country IS NOT NULL) NOT VALID;

-- Then validate it (scans the table but does not lock writes).
ALTER TABLE customer VALIDATE CONSTRAINT customer_country_not_null;

Deploy application version N+2 that reads and writes only the new column.

Contract: in a later release, drop the dual-write, and if desired drop the CHECK and replace it with a proper ALTER COLUMN ... SET NOT NULL (now cheap because the constraint already proved it).

Common Confusion / Misconception

"ALTER TABLE is always fast." Some variants take an ACCESS EXCLUSIVE lock and rewrite the entire table. Renaming, changing types, and adding columns with non-trivial defaults are the usual culprits. Always check your engine's docs for lock level.

"I can add a NOT NULL column with a default in one statement." You can, but historically on some engines this rewrites the table. PostgreSQL 11+ stores the default in the catalog and avoids the rewrite if the default is a constant; non-constant defaults still rewrite.

"Rolling back a migration means running the down script." Rolling back a committed migration may mean losing data, especially for drops and renames. Expand-contract is the reason that rollback, in practice, is "don't do step N+2 until you are confident in step N+1."

How To Use It

For every schema change:

Classify the change as additive (safe), backfilling (requires expand), or destructive (requires contract).
Check lock behavior in your engine's docs. If it takes ACCESS EXCLUSIVE on a big table, rewrite as expand/backfill/contract.
Make every intermediate state valid for the current application version.
Write the rollback step for each step.
Test on a realistic-size dataset before production.
Do not combine two destructive steps in one release.

Check Yourself

Why do we add a NOT VALID constraint first and VALIDATE it separately?
What is the risk of a single-statement ALTER TABLE ADD COLUMN c INT NOT NULL DEFAULT 0 on a 1 TB table?
When is a "rename column" actually two migrations in production?

Mini Drill or Application

Plan an online migration for each of the following, with expand/backfill/cut-over/contract phases and the DDL for each:

Split users.name into users.first_name and users.last_name.
Add a UNIQUE constraint on users.email where existing data may contain duplicates.
Change price NUMERIC(10,2) to price_cents BIGINT across a schema that uses it in joins.

For each, name the application-side changes required between phases.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​