Replication Log Formats: Statement, Logical, Physical
What This Concept Is
Every replicated database is really a distributed log with storage attached. The leader writes "what happened" into a replication log; followers read that log and apply it. The interesting engineering question is what granularity to log at. Three formats dominate:
- Statement-based: log the SQL statement itself (
UPDATE users SET last_seen = NOW() WHERE id = 42). Simple and compact, but non-deterministic statements (NOW(),RAND(), auto-increment) can diverge on replicas. - Logical (row-based): log the actual row changes produced by the statement (
row 42: last_seen: 2026-04-22T10:00:00 -> 2026-04-22T10:00:01). Deterministic, verbose, tool-friendly (CDC, stream processors). - Physical (WAL / byte-level): log the bytes-on-disk changes (page X, offset Y, new bytes Z). The fastest to apply, tied to a specific storage-engine version; followers must be binary-compatible with the leader.
The format is not cosmetic. It decides what can be replicated, to what kinds of followers, and what cross-version upgrades are possible.
Why It Matters Here
Log format determines:
- Cross-version replication: physical logs break across major versions; logical ones usually survive.
- CDC / downstream consumers: analytics and stream pipelines need logical events; they cannot consume raw WAL bytes.
- Storage engines: a logical log can be replayed by a different engine (InnoDB -> analytics engine); a physical log cannot.
- Non-determinism: statement-based replication silently corrupts replicas unless every function in every statement is deterministic.
Pick the wrong format and your "we replicate" claim quietly becomes "we usually replicate."
Concrete Example
Consider UPDATE users SET rank = FLOOR(RANDOM() * 10) WHERE tier = 'gold';.
- Statement-based (MySQL
binlog_format=STATEMENT): replica runs the same SQL, gets different random values, ranks diverge. Silently wrong. - Logical (row-based) (MySQL
binlog_format=ROW, Postgreswal_level=logical): leader logs each row's old and newrank. Replica applies exact values. Correct even withRANDOM(). - Physical (Postgres streaming replication,
wal_level=replica): leader logs the page bytes modified. Replica overwrites pages byte-identically. Works regardless of the statement, but requires the replica's storage format to match the leader's.
Postgres supports both physical (primary streaming replication) and logical (for CDC via pgoutput/Debezium). MySQL offers statement, row, and mixed.
Common Confusion / Misconception
"Logical replication is strictly better." It has strictly better tool compatibility but physical replication is usually faster for pure replica-to-replica traffic because it skips the SQL parser/planner. Postgres streaming physical replication is the gold standard for HA.
"Statement-based is dead." MySQL still supports it and large installations use mixed mode. Knowing its failure modes is part of the job.
"WAL is just for durability." WAL is the logical input to both crash recovery and physical replication. On disk, WAL is why durability and replication are often the same code path.
How To Use It
When designing or debugging a replication topology, identify:
- What format does the replication log use?
- Are there any non-deterministic functions used in writes? (If statement-based, this is a correctness bug.)
- Do downstream consumers (CDC, analytics, search) need the log? If yes, logical.
- Will the cluster span a major-version upgrade? If yes, physical is going to hurt.
- How big is the log, how fast does it grow, and how long must it be retained for a slow replica to catch up?
Writer -> SQL Parser -> Planner -> Executor
|
v
Row changes -----+--- Logical log (CDC, Debezium)
| |
v +--- Physical WAL (primary streaming)
Page changes
|
v
Storage
Check Yourself
- Which replication log format can safely carry
INSERT ... VALUES (NOW())to a replica? - Why does physical replication usually break across a major Postgres upgrade?
- Why is logical replication the natural fit for change-data-capture pipelines?
- What is the risk of MySQL
STATEMENTreplication against a function likeUUID()?
Mini Drill or Application
For each scenario, pick a log format and justify:
- Single-leader Postgres with five read replicas, all on the same version.
- Postgres 14 -> Postgres 16 zero-downtime upgrade via replication.
- A Debezium pipeline that must emit every row change to Kafka.
- A MySQL leader with followers running a slightly patched custom storage engine.
- A globally distributed MySQL cluster where you want other teams to subscribe to a subset of tables.