Skip to main content

Synchronous vs Asynchronous Replication

What This Concept Is

Given a leader and N followers, the leader must decide: when a write is committed locally, does it wait for replicas before returning "success" to the client?

  • Asynchronous: leader returns success as soon as its own WAL is durable. Replicas catch up when they can. Fast, but committed writes can be lost if the leader fails before replicas receive them.
  • Fully synchronous: leader waits for all followers to ack. Strongest durability, but one slow or dead follower blocks every write. Rarely used in practice.
  • Semi-synchronous (the practical middle): leader waits for at least one follower (or a quorum) to ack before commit. Tolerates single-node failure without data loss; degrades to async when too few replicas are available.
  async:                            semi-sync (1 of 3):              full-sync:
client L F1 F2 client L F1 F2 client L F1 F2
|--->| |--->| |--->|
|<---| (ack immediately) | |---> | |--->|--->|
| |---> | |<--- | |<---|<---|
| |---> |<---| (one ack sufficient) |<---| (all acks)
| |<--- | |--->
| |<--- | |<---

Why It Matters Here

This one knob decides what "committed" really means for your system. It is the most direct expression of the durability-vs-availability tradeoff at the storage layer.

  • Async replication is the default for good reasons (latency, availability, simplicity) but silently limits durability guarantees.
  • Full-sync is safe until the first time a follower's kernel hangs, after which no writes complete.
  • Semi-sync is the only one most real-world durability claims are actually referring to.

Understanding this avoids the embarrassing conversation of "we replicated, so no data loss" after an incident that lost exactly the writes that were async-in-flight.

Concrete Example

A banking ledger with primary and two followers (replica-sync in the same rack, replica-async in another region).

  • Configuration: Postgres synchronous_commit = on, synchronous_standby_names = 'replica-sync'.
  • Every commit waits for replica-sync to flush the WAL record. Typical added latency: 1-3 ms (same rack).
  • replica-async receives WAL as fast as the network allows; it is 50-200 ms behind.

Failure scenarios:

  • Primary crashes, replica-sync has the write. Failover to replica-sync: zero data loss.
  • Primary crashes, replica-sync also crashed simultaneously (rack loss). Failover to replica-async: last ~100 ms of commits are lost.
  • replica-sync is slow (disk pressure). Every commit pays the latency. If no other sync replica exists, writes stall entirely until configuration is changed.

Common Confusion / Misconception

"Synchronous replication eliminates data loss." Only to the replicas that synchronously confirmed. A triple-fault (primary + every sync replica + network) can still lose data. Durability is always assumption-relative.

"Async is good enough because lag is small." Lag is small in steady state. The moment the leader crashes, "small lag" becomes "exactly how much data is gone." You cannot reason about this by looking at average lag; you need the worst-case lag under stress.

"Semi-sync degrades to async safely." Only if you configured it to. Many systems (MySQL semi-sync, Postgres synchronous_commit=on with a single replica) degrade to async without alerting by default, so the degraded mode silently becomes your durability regime.

How To Use It

Three questions to pick a mode:

  1. Can the business tolerate losing the last N seconds of committed writes on leader failure?
  2. Can the business tolerate all writes stalling when a replica is slow?
  3. Is there at least one replica close enough that sync latency is acceptable?

If loss-tolerance is zero: use semi-sync with at least one close replica, monitor lag on all replicas, alert when the sync quorum cannot be met.

If loss-tolerance is "a few seconds is fine": async is correct and cheaper.

If loss-tolerance is zero and latency must also be zero: you need multi-region consensus (Raft/Paxos, not classical replication). That is Module 5.

Check Yourself

  1. What exactly does "committed" mean in async, semi-sync, and full-sync replication?
  2. Why does full-sync replication rarely appear in production?
  3. Under semi-sync with one required follower, what happens when that follower dies?
  4. A write commits on the primary and the primary immediately crashes. Under async, what is the data-loss window? Under semi-sync (with a close replica)?

Mini Drill or Application

For each workload, pick a replication mode and name the monitored alert that would catch a silent failure:

  1. Payments processor that cannot lose committed transactions.
  2. Analytics data warehouse loading nightly.
  3. A mobile app user profile that is acceptable to lose "last 5 seconds on failover."
  4. Health records system in a regulated industry.
  5. Cross-region deployment where primary is in us-east and closest follower is in us-west (~70 ms RTT).

Read This Only If Stuck