Replication Anomalies Clinic
A clinic for diagnosing read anomalies caused by replication lag. Each section is a bug report in production. Your job is to name the anomaly, name the client-visible guarantee that rules it out, and prescribe the minimum mechanism that provides that guarantee.
Retrieval Prompts
- State the definitions of read-your-writes, monotonic reads, and consistent-prefix guarantees.
- State the difference between "stale read" and "read anomaly."
- State one mechanism that provides read-your-writes without routing every read to the leader.
- State why sticky sessions can provide monotonic reads but not read-your-writes.
- State the difference between "this read is stale" and "this write was lost."
Anomaly Classification Cheat Sheet
Before starting, memorize:
| Anomaly | What the user sees | Guarantee that rules it out |
|---|---|---|
| Stale read | Any older value from a lagging replica | (depends on application) |
| Read-your-writes violation | I wrote, then read, and my write is missing | Read-your-writes / pipelined-consistency |
| Monotonic reads violation | I read X=5, then read X=3 (went backwards) | Monotonic reads |
| Consistent-prefix violation | I saw effect before cause | Consistent-prefix / causal consistency |
| Write conflict | Two leaders accepted incompatible writes | (convergence mechanism) |
Case 1: The Vanishing Comment
Bug: user posts a comment, refreshes the page, and the comment is missing about 20% of the time. If they reload again, usually it's there.
- What anomaly is this?
- What sequence of events on the client and the database produces the bug?
- Three candidate fixes -- which is minimal?
- a. Always read from the leader.
- b. After every write, send the LSN with the next read and require the follower to be caught up.
- c. Sticky-session the user to one follower.
- Which fix also accidentally fixes "my comment sometimes appears, then disappears on the next refresh"? Why?
Case 2: The Flickering Dashboard
Bug: a read-only revenue dashboard shows revenue = $120k, then on refresh $100k, then $120k again. No writes are happening fast enough to explain legitimate changes.
- What anomaly is this?
- Why does single-leader replication with read-scaling across two followers produce this?
- What is the minimum fix?
- Would sending an LSN token help? Why or why not?
Case 3: The Reversed Conversation
Bug: on a group chat, user Alice reads a question ("what's the plan for Friday?"), writes a reply ("let's meet at 3pm"), then user Bob -- refreshing his chat -- sees Alice's reply before the question.
- What anomaly is this?
- Why is it not fixed by read-your-writes for Alice?
- What is the name of the guarantee that rules this out?
- Two implementation strategies for providing it; name one and describe it in two sentences.
Case 4: The Deceptive Checkout
Bug: a shopping-cart system on a multi-leader deployment allows two users in different regions to each buy the "last available" item. Both orders confirm. Inventory goes to -1.
- Is this a replication-lag anomaly or a write-conflict problem?
- Why does read-your-writes not fix it?
- Name two architectural fixes and explain the tradeoffs.
- Why does this scenario argue for a single-leader design over multi-leader for inventory?
Case 5: The Lagging Auditor
Bug: a compliance audit job is supposed to dump every
ordersrow created in the last 24 hours. The application writes to the leader; the audit reads from an async follower lagging 2 minutes behind. Some orders are missed entirely in a given day's dump.
- What anomaly is this?
- Why does the standard consistency vocabulary not quite describe it? (hint: it's really a scheduling bug)
- What mechanism would guarantee the audit sees all orders up to a specific timestamp?
Common Mistake Check
For each statement, identify the mistake:
- "Our replicas are close, so lag is negligible."
- "Reading from the leader always fixes stale reads."
- "If the user only writes sometimes, anomalies are rare and we can ignore them."
- "Monotonic reads and read-your-writes are the same thing."
- "A conflict is just a stale read that never catches up."
Prescriptive Drill
For each guarantee, write a one-paragraph implementation note: at which layer it lives (DB, application, client), what the runtime cost is, what it does not protect against.
- Read-your-writes
- Monotonic reads
- Consistent-prefix (causal)
- Linearizable reads
- Eventual consistency (as a non-guarantee)
Evidence Check
This clinic is complete only if:
- You have diagnosed all five cases by name.
- You have prescribed a guarantee and a mechanism for each.
- You can explain why "just use the leader" is a real but expensive fix, and you have written at least one follower-based alternative.
- You have mapped each guarantee to a real system (Postgres + application token, Cassandra local_quorum, MongoDB
readConcern: majority, etc.).