Skip to main content

Replication Anomalies Clinic

A clinic for diagnosing read anomalies caused by replication lag. Each section is a bug report in production. Your job is to name the anomaly, name the client-visible guarantee that rules it out, and prescribe the minimum mechanism that provides that guarantee.

Retrieval Prompts

  1. State the definitions of read-your-writes, monotonic reads, and consistent-prefix guarantees.
  2. State the difference between "stale read" and "read anomaly."
  3. State one mechanism that provides read-your-writes without routing every read to the leader.
  4. State why sticky sessions can provide monotonic reads but not read-your-writes.
  5. State the difference between "this read is stale" and "this write was lost."

Anomaly Classification Cheat Sheet

Before starting, memorize:

AnomalyWhat the user seesGuarantee that rules it out
Stale readAny older value from a lagging replica(depends on application)
Read-your-writes violationI wrote, then read, and my write is missingRead-your-writes / pipelined-consistency
Monotonic reads violationI read X=5, then read X=3 (went backwards)Monotonic reads
Consistent-prefix violationI saw effect before causeConsistent-prefix / causal consistency
Write conflictTwo leaders accepted incompatible writes(convergence mechanism)

Case 1: The Vanishing Comment

Bug: user posts a comment, refreshes the page, and the comment is missing about 20% of the time. If they reload again, usually it's there.

  1. What anomaly is this?
  2. What sequence of events on the client and the database produces the bug?
  3. Three candidate fixes -- which is minimal?
    • a. Always read from the leader.
    • b. After every write, send the LSN with the next read and require the follower to be caught up.
    • c. Sticky-session the user to one follower.
  4. Which fix also accidentally fixes "my comment sometimes appears, then disappears on the next refresh"? Why?

Case 2: The Flickering Dashboard

Bug: a read-only revenue dashboard shows revenue = $120k, then on refresh $100k, then $120k again. No writes are happening fast enough to explain legitimate changes.

  1. What anomaly is this?
  2. Why does single-leader replication with read-scaling across two followers produce this?
  3. What is the minimum fix?
  4. Would sending an LSN token help? Why or why not?

Case 3: The Reversed Conversation

Bug: on a group chat, user Alice reads a question ("what's the plan for Friday?"), writes a reply ("let's meet at 3pm"), then user Bob -- refreshing his chat -- sees Alice's reply before the question.

  1. What anomaly is this?
  2. Why is it not fixed by read-your-writes for Alice?
  3. What is the name of the guarantee that rules this out?
  4. Two implementation strategies for providing it; name one and describe it in two sentences.

Case 4: The Deceptive Checkout

Bug: a shopping-cart system on a multi-leader deployment allows two users in different regions to each buy the "last available" item. Both orders confirm. Inventory goes to -1.

  1. Is this a replication-lag anomaly or a write-conflict problem?
  2. Why does read-your-writes not fix it?
  3. Name two architectural fixes and explain the tradeoffs.
  4. Why does this scenario argue for a single-leader design over multi-leader for inventory?

Case 5: The Lagging Auditor

Bug: a compliance audit job is supposed to dump every orders row created in the last 24 hours. The application writes to the leader; the audit reads from an async follower lagging 2 minutes behind. Some orders are missed entirely in a given day's dump.

  1. What anomaly is this?
  2. Why does the standard consistency vocabulary not quite describe it? (hint: it's really a scheduling bug)
  3. What mechanism would guarantee the audit sees all orders up to a specific timestamp?

Common Mistake Check

For each statement, identify the mistake:

  1. "Our replicas are close, so lag is negligible."
  2. "Reading from the leader always fixes stale reads."
  3. "If the user only writes sometimes, anomalies are rare and we can ignore them."
  4. "Monotonic reads and read-your-writes are the same thing."
  5. "A conflict is just a stale read that never catches up."

Prescriptive Drill

For each guarantee, write a one-paragraph implementation note: at which layer it lives (DB, application, client), what the runtime cost is, what it does not protect against.

  • Read-your-writes
  • Monotonic reads
  • Consistent-prefix (causal)
  • Linearizable reads
  • Eventual consistency (as a non-guarantee)

Evidence Check

This clinic is complete only if:

  • You have diagnosed all five cases by name.
  • You have prescribed a guarantee and a mechanism for each.
  • You can explain why "just use the leader" is a real but expensive fix, and you have written at least one follower-based alternative.
  • You have mapped each guarantee to a real system (Postgres + application token, Cassandra local_quorum, MongoDB readConcern: majority, etc.).