Skip to main content

Replication Log Formats: Statement, Logical, Physical

What This Concept Is

Every replicated database is really a distributed log with storage attached. The leader writes "what happened" into a replication log; followers read that log and apply it. The interesting engineering question is what granularity to log at. Three formats dominate:

  • Statement-based: log the SQL statement itself (UPDATE users SET last_seen = NOW() WHERE id = 42). Simple and compact, but non-deterministic statements (NOW(), RAND(), auto-increment) can diverge on replicas.
  • Logical (row-based): log the actual row changes produced by the statement (row 42: last_seen: 2026-04-22T10:00:00 -> 2026-04-22T10:00:01). Deterministic, verbose, tool-friendly (CDC, stream processors).
  • Physical (WAL / byte-level): log the bytes-on-disk changes (page X, offset Y, new bytes Z). The fastest to apply, tied to a specific storage-engine version; followers must be binary-compatible with the leader.

The format is not cosmetic. It decides what can be replicated, to what kinds of followers, and what cross-version upgrades are possible.

Why It Matters Here

Log format determines:

  • Cross-version replication: physical logs break across major versions; logical ones usually survive.
  • CDC / downstream consumers: analytics and stream pipelines need logical events; they cannot consume raw WAL bytes.
  • Storage engines: a logical log can be replayed by a different engine (InnoDB -> analytics engine); a physical log cannot.
  • Non-determinism: statement-based replication silently corrupts replicas unless every function in every statement is deterministic.

Pick the wrong format and your "we replicate" claim quietly becomes "we usually replicate."

Concrete Example

Consider UPDATE users SET rank = FLOOR(RANDOM() * 10) WHERE tier = 'gold';.

  • Statement-based (MySQL binlog_format=STATEMENT): replica runs the same SQL, gets different random values, ranks diverge. Silently wrong.
  • Logical (row-based) (MySQL binlog_format=ROW, Postgres wal_level=logical): leader logs each row's old and new rank. Replica applies exact values. Correct even with RANDOM().
  • Physical (Postgres streaming replication, wal_level=replica): leader logs the page bytes modified. Replica overwrites pages byte-identically. Works regardless of the statement, but requires the replica's storage format to match the leader's.

Postgres supports both physical (primary streaming replication) and logical (for CDC via pgoutput/Debezium). MySQL offers statement, row, and mixed.

Common Confusion / Misconception

"Logical replication is strictly better." It has strictly better tool compatibility but physical replication is usually faster for pure replica-to-replica traffic because it skips the SQL parser/planner. Postgres streaming physical replication is the gold standard for HA.

"Statement-based is dead." MySQL still supports it and large installations use mixed mode. Knowing its failure modes is part of the job.

"WAL is just for durability." WAL is the logical input to both crash recovery and physical replication. On disk, WAL is why durability and replication are often the same code path.

How To Use It

When designing or debugging a replication topology, identify:

  1. What format does the replication log use?
  2. Are there any non-deterministic functions used in writes? (If statement-based, this is a correctness bug.)
  3. Do downstream consumers (CDC, analytics, search) need the log? If yes, logical.
  4. Will the cluster span a major-version upgrade? If yes, physical is going to hurt.
  5. How big is the log, how fast does it grow, and how long must it be retained for a slow replica to catch up?
  Writer -> SQL Parser -> Planner -> Executor
|
v
Row changes -----+--- Logical log (CDC, Debezium)
| |
v +--- Physical WAL (primary streaming)
Page changes
|
v
Storage

Check Yourself

  1. Which replication log format can safely carry INSERT ... VALUES (NOW()) to a replica?
  2. Why does physical replication usually break across a major Postgres upgrade?
  3. Why is logical replication the natural fit for change-data-capture pipelines?
  4. What is the risk of MySQL STATEMENT replication against a function like UUID()?

Mini Drill or Application

For each scenario, pick a log format and justify:

  1. Single-leader Postgres with five read replicas, all on the same version.
  2. Postgres 14 -> Postgres 16 zero-downtime upgrade via replication.
  3. A Debezium pipeline that must emit every row change to Kafka.
  4. A MySQL leader with followers running a slightly patched custom storage engine.
  5. A globally distributed MySQL cluster where you want other teams to subscribe to a subset of tables.

Read This Only If Stuck