Replication Lag: Read-Your-Writes and Monotonic Reads
What This Concept Is
Asynchronous replication means followers lag the leader. Lag is usually milliseconds; under load, it can balloon to seconds or minutes. When the application reads from followers (for scale), that lag becomes client-visible as read anomalies:
- Stale read: a follower returns a value older than the most recent committed write on the leader.
- Read-your-writes violation: a user writes something, refreshes the page, and does not see their own change because the refresh hit a lagging follower.
- Monotonic-reads violation: a user reads a value (from a fresh follower), then reads again and sees an older value (from a laggier follower). Data appears to travel backwards in time.
- Consistent-prefix violation: an observer sees a causally later event before the event that caused it.
The three client-visible guarantees the application can request from the system:
- Read-your-writes (RYW): after you successfully write, subsequent reads by you always reflect your write.
- Monotonic reads: if you read a value once, subsequent reads never go backwards in time.
- Consistent-prefix: if writes happen in order
W1, W2, no observer seesW2without also seeingW1.
Why It Matters Here
These anomalies are the most common way "we replicated our database" turns into "our user is confused." They do not show up in unit tests; they appear in production, under lag, on specific users. Recognizing the anomaly and prescribing the guarantee that rules it out is the core operational skill of this cluster.
Concrete Example
users table is single-leader-replicated with two async followers lagging 200 ms behind.
- User Alice updates her profile bio on the leader. The update commits;
updated_at = T. - Alice's browser refreshes the profile page 50 ms later. The load balancer routes the read to
follower-2, which has lag = 250 ms. - The page shows Alice's old bio. Alice reloads. This time it goes to
follower-1, which is caught up. Fresh bio appears. - Alice reloads again, lands on
follower-2(still lagging). Old bio again.
That is a read-your-writes violation (step 2) and a monotonic-reads violation (step 4 after step 3).
Three fixes, each buying a specific guarantee:
- Read-your-writes via leader reads for your own data: route reads for the current user's own profile to the leader (or to a follower with lag ≤ X ms). Low-tech, works.
- Read-your-writes via version tokens: write returns a "write version" (WAL LSN); subsequent reads send the token and the follower waits until it has caught up past that LSN.
- Monotonic reads via sticky sessions: route all reads from one session to the same follower. Staleness stays the same size, but does not flicker.
Common Confusion / Misconception
"Increase replication speed to fix this." You can reduce lag but never eliminate it. Any solution that depends on "lag is small" breaks when it isn't. The honest solutions are the three client-visible guarantees above, which tolerate lag rather than deny its existence.
"All read anomalies are the same bug." They are not. Read-your-writes is fixed by routing or versioning; monotonic reads is fixed by sticky sessions; consistent-prefix is fixed by causal ordering or by reading from one replica. Prescribing the wrong fix leaves the anomaly and adds complexity.
"Just read from the leader for everything." Kills read-scaling and defeats the purpose of the followers. And if the leader is remote, cross-region reads now cost 100 ms each.
How To Use It
When a stale-read bug lands in your ticket queue:
- Classify: RYW, monotonic, or consistent-prefix?
- Propose the guarantee that rules it out.
- Implement the minimum mechanism (leader-read for own-data, version tokens, sticky sessions).
- Instrument replication lag on every follower. Alert when lag exceeds the application's SLA.
Wrote X=5 at LSN 1234 on leader
|
+--> follower-1 at LSN 1234 -> read returns 5 (OK)
|
+--> follower-2 at LSN 1200 -> read returns old -> anomaly
|
Fix A: send LSN 1234 with next read, follower waits until LSN >= 1234
Fix B: session pinned to follower-1
Fix C: own-data reads forced to leader
Check Yourself
- Describe the read-your-writes anomaly with an example.
- Describe the monotonic-reads anomaly with an example.
- Why does "read from the leader" eliminate both anomalies but is rarely acceptable at scale?
- What does a write-version token (LSN, timestamp) buy you that sticky sessions do not?
Mini Drill or Application
For each bug report, name the anomaly and the guarantee that eliminates it:
- A user posts a comment. Refreshing the page shows the comment has disappeared. Refreshing again brings it back.
- A user sends a chat message, then looks at their conversation list. The list shows the old "last message" for a few seconds.
- A read-only dashboard alternates between "revenue = $100k" and "revenue = $80k" as the page refreshes.
- User Alice and user Bob both comment on Alice's post. Bob sees his reply appear before Alice's original post on his feed.
- A CI pipeline writes a status row and then queries for it. Sometimes the query returns empty.