Module 3: Replication & Partitioning
Primary text: Designing Data-Intensive Applications (Kleppmann), Chapters 5 and 6
Selective support: Database Internals (Petrov) Part II, Distributed Systems: Concepts and Design (Coulouris) for theoretical framing, Database System Concepts (Silberschatz) for the relational-replication view
This guide is the primary teacher. You do not need to read DDIA Chapters 5 and 6 end-to-end to complete this module. You do need to be operationally strong at reasoning about replication topologies, partitioning strategies, and failure modes before entering Module 4 (Transactions and Consistency).
Scope of This Module
This module is where storage starts to live on more than one machine. It is not about any specific database; it is about the patterns that every distributed database instantiates.
What it covers in depth:
- why systems replicate (availability, throughput, geolocation) and why systems partition (scale beyond one node)
- the CAP intuition: network partitions are a kind of fault you cannot opt out of, so you choose between consistency and availability when one happens
- single-leader, multi-leader, and leaderless replication as three distinct algorithms with three distinct failure modes
- statement-based, logical (row-based), and physical replication log formats and when each is used
- synchronous vs asynchronous replication and what each promises about durability
- replication lag and the application-visible anomalies it creates (stale reads, non-monotonic reads, causality violations)
- range vs hash partitioning as two opposing answers to "how do we split the keyspace"
- local vs global secondary indexes in partitioned data
- rebalancing, shard splitting, and how to avoid hotspots
- request routing: how a client finds the node that owns a key
- failover and split-brain: what happens when the leader dies or is thought to have died
What it deliberately does not try to finish here:
- formal consistency models (linearizability, causal consistency, serializability): Module 4
- distributed transactions and 2PC: Module 4
- consensus algorithms (Raft, Paxos) in depth: Module 5
- time, clocks, and ordering in distributed systems: Module 5
- stream processing and derived data pipelines: later semesters
This module is the operational substrate that Module 4 (consistency) and Module 5 (distributed-systems fundamentals) formalize.
Before You Start
Answer these closed-book before starting the main path:
- Why do we replicate data? Give three different reasons, each with a distinct goal.
- Why do we partition (shard) data? What breaks if a single node has to hold the entire dataset?
- If a network partition cuts two datacenters apart, and the application requires strong consistency, what must happen to requests on the minority side?
- Given a leader-based database that uses asynchronous replication, what does it mean for a client to "read its own writes"? Why might that fail?
- You have 1 billion users keyed by
user_id. Why is range partitioning byuser_idlikely a bad idea?
Diagnostic Interpretation
4-5 solid answers
- You are ready for the full path.
2-3 solid answers
- Continue, but expect extra time in Clusters 2 (topologies) and 4 (partitioning strategies).
0-1 solid answers
- Revisit Module 1 (relational databases) and Module 2 (storage engines). This module assumes you know what a row, an index, and a log look like on a single node.
What This Module Is For
Replication and partitioning is the language every later distributed-systems claim is made in. Throughout the program you will repeatedly be asked:
- will this design survive one node failing? two? the whole rack?
- which consistency model does this replication topology give me?
- if I add a shard, how much data moves, and does the system stay available while it moves?
- given a workload, which key do I partition on, and why?
- can I reason about this failure before it happens, or only after the postmortem?
This module builds the reasoning needed for:
- transactions and consistency models (Module 4)
- distributed-systems fundamentals, time, and consensus (Module 5)
- stream processing, derived data, and event sourcing (later semesters)
- real-world incident response on replicated databases
You are learning to read a replication and partitioning design and know in advance what it can and cannot survive.
Concept Map
How To Use This Module
Work in order. Clusters 1-3 are about replication; Cluster 4 is about partitioning; Cluster 5 is where both come together in real systems.
Cluster 1: Why Replicate and Partition
| Order | Concept | Type | Focus |
|---|---|---|---|
| 1 | Replication Goals: Availability, Throughput, Geolocation | PRIMARY | Three distinct reasons to replicate and how they shape topology |
| 2 | Partitioning Goals: Scaling Beyond One Node | PRIMARY | Why data outgrows a single machine and what partitioning buys |
| 3 | The CAP Intuition: Partition-Tolerance Is Mandatory | PRIMARY | CAP as a runtime choice under a partition, not a menu of three |
Cluster mastery check: Can you explain to a new teammate why a system replicates, why it partitions, and what the CAP theorem actually forces you to choose?
Cluster 2: Replication Topologies
| Order | Concept | Type | Focus |
|---|---|---|---|
| 4 | Single-Leader Replication | PRIMARY | One leader, many followers; failover; read scaling |
| 5 | Multi-Leader Replication | PRIMARY | Conflicts, topologies (all-to-all, ring, star), convergence rules |
| 6 | Leaderless Replication (Dynamo-Style) | PRIMARY | Quorums W + R > N, read repair, hinted handoff |
Cluster mastery check: Given an application (geo-distributed writes vs single-region reads vs mostly-offline mobile) can you pick one topology and defend the choice with a failure-mode argument?
Cluster 3: Replication Mechanics
| Order | Concept | Type | Focus |
|---|---|---|---|
| 7 | Replication Log Formats: Statement, Logical, Physical | PRIMARY | Three log types and the non-determinism trap |
| 8 | Synchronous vs Asynchronous Replication | PRIMARY | Durability vs availability tradeoff, semi-sync compromise |
| 9 | Replication Lag: Read-Your-Writes and Monotonic Reads | PRIMARY | Stale reads, anomalies, client-visible guarantees |
Cluster mastery check: Given a replication setup, can you predict which reads will be stale and what guarantees the application must add on top?
Cluster 4: Partitioning Strategies
| Order | Concept | Type | Focus |
|---|---|---|---|
| 10 | Range vs Hash Partitioning | PRIMARY | Ordered scans vs uniform spread; picking a partition key |
| 11 | Secondary Indexes: Local vs Global | PRIMARY | Document-partitioned vs term-partitioned tradeoffs |
| 12 | Rebalancing, Shard Splitting, and Hotspots | PRIMARY | Adding nodes without moving everything; hotspot diagnosis |
Cluster mastery check: Given a workload, can you pick a partition key, justify range or hash, and plan how rebalancing will work as the cluster grows?
Cluster 5: Practical Systems
| Order | Concept | Type | Focus |
|---|---|---|---|
| 13 | Request Routing: Gossip, Service Discovery, Coordinators | SUPPORTING | Three routing approaches; ZooKeeper/etcd coordination |
| 14 | Case Studies: PostgreSQL, MongoDB, Cassandra | PRIMARY | Mapping three real systems to this module's concepts |
| 15 | Failover and Split-Brain | PRIMARY | Detecting dead leaders, quorum fencing, STONITH |
Cluster mastery check: Can you walk through a failover scenario in PostgreSQL, MongoDB, and Cassandra and point at what would cause split-brain in each?
Then work these practice pages:
| Order | Practice path | Focus |
|---|---|---|
| 1 | Replication Topologies Lab | Draw and walk through single-leader, multi-leader, and leaderless topologies under failure |
| 2 | Partitioning and Rebalancing Workshop | Pick partition keys, schemes, and rebalancing plans for contrasting workloads |
| 3 | Replication Anomalies Clinic | Diagnose stale-read bugs and prescribe the right session guarantee |
| 4 | Code Katas | Failover simulation, lag diagnosis, sharding design, and Jepsen report analysis |
Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.
Learning Objectives
By the end of this module you should be able to:
- Name three distinct reasons to replicate and three distinct reasons to partition, and tell a designer which of the two a given problem calls for.
- Explain the CAP theorem precisely: what partition-tolerance means, what the C/A tradeoff becomes during a partition, and why "pick 2 of 3" is misleading.
- Draw the write and read paths for single-leader, multi-leader, and leaderless replication, and name their failure modes.
- Read a replication log (statement, logical, physical) and explain which non-determinism each format protects against.
- Predict whether a given replication setup is synchronous, semi-synchronous, or asynchronous, and what durability guarantee it gives on leader failure.
- Given replication lag, name the anomalies a client can see and prescribe a guarantee (read-your-writes, monotonic reads, consistent prefix) that rules each out.
- Choose range vs hash partitioning for a workload and justify the pick with a scan-access vs load-balance argument.
- Explain when a local (document-partitioned) secondary index is enough and when you need a global (term-partitioned) one.
- Describe at least two rebalancing strategies (fixed partitions, dynamic partitioning, proportional to nodes) and what each one avoids.
- Walk through a leader failover step by step, name the ways split-brain can happen, and describe how fencing, quorum-based election, or STONITH prevents each.
Outputs
- a replication-topology notebook: for at least 6 real or described systems, name the topology, the log format, the sync mode, and the expected failure modes
- a partition-design log: for at least 6 workloads, record partition key, strategy (range/hash), rebalancing plan, and hotspot risks
- at least 3 worked failover scenarios with a timeline showing writes before, during, and after the failover
- at least 3 read-anomaly diagnoses naming the client-visible guarantee needed and the server-side mechanism that provides it
- a written summary of at least one real Jepsen report, identifying the invariant tested, the violation found, and the root cause
- a case-study comparison of PostgreSQL streaming replication, MongoDB replica sets, and Cassandra rings, aligned to this module's concept map
- a mistake log naming at least 8 recurring replication/partitioning design mistakes (
async replication to the leader's own datacenter only,partition key is timestamp,sync replication with no follower,no fencing on failover, etc.) - a short memo on at least two incidents (real or hypothetical) where the topology or partition strategy caused the outage
Completion Standard
You have completed Module 3 when all of these are true:
- you can classify a replication setup by topology, log format, and sync mode without looking it up
- you can state what partition-tolerance forces a system to choose during a network fault
- you can pick a partition key for a new workload and defend the choice in one paragraph
- you can walk through a leader failover and name every place split-brain or data loss can enter
- you can diagnose a stale-read bug to a specific missing guarantee and prescribe the fix
- you have read at least one real Jepsen report end-to-end and can restate the violation in this module's vocabulary
- you can map PostgreSQL, MongoDB, and Cassandra to the same concept map and call out the places they differ
If the replication diagram "looks right" but you cannot say what it survives and what it does not, the module is not complete.
Reading Policy
- Concept pages are the main path.
- Local book chunks (mostly DDIA Chapters 5 and 6) are selective reinforcement, not a second syllabus.
Read This Only If Stuckmeans try the concept page, self-check, and drill first.- External URLs (Jepsen, PostgreSQL docs, Cassandra docs) are validated and safe to open.
- Because this module is the operational base for Modules 4 and 5, at least one failover walkthrough and one sharding-design writeup are required, not optional enrichment.
Suggested Weekly Flow
| Day | Work |
|---|---|
| 1 | Concepts 1-3, write a one-page "why replicate / why partition" memo |
| 2 | Concepts 4-5, draw single-leader and multi-leader diagrams from memory |
| 3 | Concept 6, simulate a quorum W + R > N by hand with a node failure |
| 4 | Concepts 7-8, classify three real databases by log format and sync mode |
| 5 | Concept 9 and Practice 2 (replication-lag diagnosis) |
| 6 | Concepts 10-11, design partitioning for two contrasting workloads |
| 7 | Concept 12 and Practice 3 (sharding scheme design) |
| 8 | Concepts 13-14, map PostgreSQL, MongoDB, Cassandra to the concept map |
| 9 | Concept 15 and Practice 1 (failover simulation) |
| 10 | Practice 4 (Jepsen analysis), quiz, mistake-log cleanup |
Reference
If you need exact links into the local chunked books, use Reference and Selective Reading.
The Kafka-like Distributed Log tutorial is the canonical project for this module: partitioning, replication, consumer offsets, leader election. The Blockchain tutorial is a contrasting replication model — every node holds the full log; consensus via proof-of-work rather than leader election. See Build Your Own X overview.
Rich Learning Pages
Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread