Schema Migrations: Adding/Removing Columns, Online Migrations
What This Concept Is
A schema migration is a change to the database's structure: a new column, a new constraint, a renamed table, a split table, a merged table. Production migrations are run against live data with live traffic and therefore have two axes you cannot ignore:
- Locking behavior: which statements block reads or writes on which objects, for how long?
- Compatibility window: is the application able to run both before and after the migration? Is there a version of the application that runs correctly against both the old and the new schema?
The online migration pattern is:
- Expand: add the new shape without removing the old. Both shapes coexist.
- Backfill: populate the new shape from the old, in small batches, without locking the table.
- Cut over: flip the application to read and write the new shape (often via a feature flag or deployment).
- Contract: drop the old shape.
Each step is independently reversible. Each step runs while the system serves traffic.
Why It Matters Here
- the moment your database has non-trivial data, you cannot
ALTER TABLE ... ADD NOT NULL DEFAULT Xwithout thinking - schema changes that look identical differ sharply in cost: PostgreSQL adding a nullable column with no default is metadata-only and fast; adding
NOT NULL DEFAULT expensive_function()used to rewrite the entire table (improved in PG 11+ but still a footgun) - every irreversible migration is a potential outage; expand/backfill/contract makes everything reversible
Concrete Example
Goal: add customer.country (non-null, derived from billing_address.country), to a 200 M-row table without downtime.
Expand:
ALTER TABLE customer ADD COLUMN country CHAR(2) NULL;
-- Metadata-only in modern PostgreSQL. Immediate.
Deploy application version N+1 that writes both old and new columns (dual-write).
Backfill in chunks:
-- Run in a loop until 0 rows affected, outside the main transaction.
UPDATE customer
SET country = ba.country_code
FROM billing_address ba
WHERE ba.customer_id = customer.id
AND customer.country IS NULL
AND customer.id IN (
SELECT id FROM customer WHERE country IS NULL LIMIT 10000
);
Each batch is small enough to commit without holding long locks. When SELECT COUNT(*) FROM customer WHERE country IS NULL hits 0, move on.
Cut over:
-- Add a NOT VALID constraint (cheap; validates without a full scan).
ALTER TABLE customer
ADD CONSTRAINT customer_country_not_null CHECK (country IS NOT NULL) NOT VALID;
-- Then validate it (scans the table but does not lock writes).
ALTER TABLE customer VALIDATE CONSTRAINT customer_country_not_null;
Deploy application version N+2 that reads and writes only the new column.
Contract: in a later release, drop the dual-write, and if desired drop the CHECK and replace it with a proper ALTER COLUMN ... SET NOT NULL (now cheap because the constraint already proved it).
Common Confusion / Misconception
"ALTER TABLE is always fast." Some variants take an ACCESS EXCLUSIVE lock and rewrite the entire table. Renaming, changing types, and adding columns with non-trivial defaults are the usual culprits. Always check your engine's docs for lock level.
"I can add a NOT NULL column with a default in one statement." You can, but historically on some engines this rewrites the table. PostgreSQL 11+ stores the default in the catalog and avoids the rewrite if the default is a constant; non-constant defaults still rewrite.
"Rolling back a migration means running the down script." Rolling back a committed migration may mean losing data, especially for drops and renames. Expand-contract is the reason that rollback, in practice, is "don't do step N+2 until you are confident in step N+1."
How To Use It
For every schema change:
- Classify the change as additive (safe), backfilling (requires expand), or destructive (requires contract).
- Check lock behavior in your engine's docs. If it takes
ACCESS EXCLUSIVEon a big table, rewrite as expand/backfill/contract. - Make every intermediate state valid for the current application version.
- Write the rollback step for each step.
- Test on a realistic-size dataset before production.
- Do not combine two destructive steps in one release.
Check Yourself
- Why do we add a
NOT VALIDconstraint first andVALIDATEit separately? - What is the risk of a single-statement
ALTER TABLE ADD COLUMN c INT NOT NULL DEFAULT 0on a 1 TB table? - When is a "rename column" actually two migrations in production?
Mini Drill or Application
Plan an online migration for each of the following, with expand/backfill/cut-over/contract phases and the DDL for each:
- Split
users.nameintousers.first_nameandusers.last_name. - Add a
UNIQUEconstraint onusers.emailwhere existing data may contain duplicates. - Change
price NUMERIC(10,2)toprice_cents BIGINTacross a schema that uses it in joins.
For each, name the application-side changes required between phases.