Module 3: Replication & Partitioning

Primary text: Designing Data-Intensive Applications (Kleppmann), Chapters 5 and 6
Selective support: Database Internals (Petrov) Part II, Distributed Systems: Concepts and Design (Coulouris) for theoretical framing, Database System Concepts (Silberschatz) for the relational-replication view

This guide is the primary teacher. You do not need to read DDIA Chapters 5 and 6 end-to-end to complete this module. You do need to be operationally strong at reasoning about replication topologies, partitioning strategies, and failure modes before entering Module 4 (Transactions and Consistency).

Scope of This Module

This module is where storage starts to live on more than one machine. It is not about any specific database; it is about the patterns that every distributed database instantiates.

What it covers in depth:

why systems replicate (availability, throughput, geolocation) and why systems partition (scale beyond one node)
the CAP intuition: network partitions are a kind of fault you cannot opt out of, so you choose between consistency and availability when one happens
single-leader, multi-leader, and leaderless replication as three distinct algorithms with three distinct failure modes
statement-based, logical (row-based), and physical replication log formats and when each is used
synchronous vs asynchronous replication and what each promises about durability
replication lag and the application-visible anomalies it creates (stale reads, non-monotonic reads, causality violations)
range vs hash partitioning as two opposing answers to "how do we split the keyspace"
local vs global secondary indexes in partitioned data
rebalancing, shard splitting, and how to avoid hotspots
request routing: how a client finds the node that owns a key
failover and split-brain: what happens when the leader dies or is thought to have died

What it deliberately does not try to finish here:

formal consistency models (linearizability, causal consistency, serializability): Module 4
distributed transactions and 2PC: Module 4
consensus algorithms (Raft, Paxos) in depth: Module 5
time, clocks, and ordering in distributed systems: Module 5
stream processing and derived data pipelines: later semesters

This module is the operational substrate that Module 4 (consistency) and Module 5 (distributed-systems fundamentals) formalize.

Before You Start

Answer these closed-book before starting the main path:

Why do we replicate data? Give three different reasons, each with a distinct goal.
Why do we partition (shard) data? What breaks if a single node has to hold the entire dataset?
If a network partition cuts two datacenters apart, and the application requires strong consistency, what must happen to requests on the minority side?
Given a leader-based database that uses asynchronous replication, what does it mean for a client to "read its own writes"? Why might that fail?
You have 1 billion users keyed by user_id. Why is range partitioning by user_id likely a bad idea?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path.

2-3 solid answers

Continue, but expect extra time in Clusters 2 (topologies) and 4 (partitioning strategies).

0-1 solid answers

Revisit Module 1 (relational databases) and Module 2 (storage engines). This module assumes you know what a row, an index, and a log look like on a single node.

What This Module Is For

Replication and partitioning is the language every later distributed-systems claim is made in. Throughout the program you will repeatedly be asked:

will this design survive one node failing? two? the whole rack?
which consistency model does this replication topology give me?
if I add a shard, how much data moves, and does the system stay available while it moves?
given a workload, which key do I partition on, and why?
can I reason about this failure before it happens, or only after the postmortem?

This module builds the reasoning needed for:

transactions and consistency models (Module 4)
distributed-systems fundamentals, time, and consensus (Module 5)
stream processing, derived data, and event sourcing (later semesters)
real-world incident response on replicated databases

You are learning to read a replication and partitioning design and know in advance what it can and cannot survive.

Concept Map

How To Use This Module

Work in order. Clusters 1-3 are about replication; Cluster 4 is about partitioning; Cluster 5 is where both come together in real systems.

Cluster 1: Why Replicate and Partition

Order	Concept	Type	Focus
1	Replication Goals: Availability, Throughput, Geolocation	PRIMARY	Three distinct reasons to replicate and how they shape topology
2	Partitioning Goals: Scaling Beyond One Node	PRIMARY	Why data outgrows a single machine and what partitioning buys
3	The CAP Intuition: Partition-Tolerance Is Mandatory	PRIMARY	CAP as a runtime choice under a partition, not a menu of three

Cluster mastery check: Can you explain to a new teammate why a system replicates, why it partitions, and what the CAP theorem actually forces you to choose?

Cluster 2: Replication Topologies

Order	Concept	Type	Focus
4	Single-Leader Replication	PRIMARY	One leader, many followers; failover; read scaling
5	Multi-Leader Replication	PRIMARY	Conflicts, topologies (all-to-all, ring, star), convergence rules
6	Leaderless Replication (Dynamo-Style)	PRIMARY	Quorums `W + R > N`, read repair, hinted handoff

Cluster mastery check: Given an application (geo-distributed writes vs single-region reads vs mostly-offline mobile) can you pick one topology and defend the choice with a failure-mode argument?

Cluster 3: Replication Mechanics

Order	Concept	Type	Focus
7	Replication Log Formats: Statement, Logical, Physical	PRIMARY	Three log types and the non-determinism trap
8	Synchronous vs Asynchronous Replication	PRIMARY	Durability vs availability tradeoff, semi-sync compromise
9	Replication Lag: Read-Your-Writes and Monotonic Reads	PRIMARY	Stale reads, anomalies, client-visible guarantees

Cluster mastery check: Given a replication setup, can you predict which reads will be stale and what guarantees the application must add on top?

Cluster 4: Partitioning Strategies

Order	Concept	Type	Focus
10	Range vs Hash Partitioning	PRIMARY	Ordered scans vs uniform spread; picking a partition key
11	Secondary Indexes: Local vs Global	PRIMARY	Document-partitioned vs term-partitioned tradeoffs
12	Rebalancing, Shard Splitting, and Hotspots	PRIMARY	Adding nodes without moving everything; hotspot diagnosis

Cluster mastery check: Given a workload, can you pick a partition key, justify range or hash, and plan how rebalancing will work as the cluster grows?

Cluster 5: Practical Systems

Order	Concept	Type	Focus
13	Request Routing: Gossip, Service Discovery, Coordinators	SUPPORTING	Three routing approaches; ZooKeeper/etcd coordination
14	Case Studies: PostgreSQL, MongoDB, Cassandra	PRIMARY	Mapping three real systems to this module's concepts
15	Failover and Split-Brain	PRIMARY	Detecting dead leaders, quorum fencing, STONITH

Cluster mastery check: Can you walk through a failover scenario in PostgreSQL, MongoDB, and Cassandra and point at what would cause split-brain in each?

Then work these practice pages:

Order	Practice path	Focus
1	Replication Topologies Lab	Draw and walk through single-leader, multi-leader, and leaderless topologies under failure
2	Partitioning and Rebalancing Workshop	Pick partition keys, schemes, and rebalancing plans for contrasting workloads
3	Replication Anomalies Clinic	Diagnose stale-read bugs and prescribe the right session guarantee
4	Code Katas	Failover simulation, lag diagnosis, sharding design, and Jepsen report analysis

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Name three distinct reasons to replicate and three distinct reasons to partition, and tell a designer which of the two a given problem calls for.
Explain the CAP theorem precisely: what partition-tolerance means, what the C/A tradeoff becomes during a partition, and why "pick 2 of 3" is misleading.
Draw the write and read paths for single-leader, multi-leader, and leaderless replication, and name their failure modes.
Read a replication log (statement, logical, physical) and explain which non-determinism each format protects against.
Predict whether a given replication setup is synchronous, semi-synchronous, or asynchronous, and what durability guarantee it gives on leader failure.
Given replication lag, name the anomalies a client can see and prescribe a guarantee (read-your-writes, monotonic reads, consistent prefix) that rules each out.
Choose range vs hash partitioning for a workload and justify the pick with a scan-access vs load-balance argument.
Explain when a local (document-partitioned) secondary index is enough and when you need a global (term-partitioned) one.
Describe at least two rebalancing strategies (fixed partitions, dynamic partitioning, proportional to nodes) and what each one avoids.
Walk through a leader failover step by step, name the ways split-brain can happen, and describe how fencing, quorum-based election, or STONITH prevents each.

Outputs

a replication-topology notebook: for at least 6 real or described systems, name the topology, the log format, the sync mode, and the expected failure modes
a partition-design log: for at least 6 workloads, record partition key, strategy (range/hash), rebalancing plan, and hotspot risks
at least 3 worked failover scenarios with a timeline showing writes before, during, and after the failover
at least 3 read-anomaly diagnoses naming the client-visible guarantee needed and the server-side mechanism that provides it
a written summary of at least one real Jepsen report, identifying the invariant tested, the violation found, and the root cause
a case-study comparison of PostgreSQL streaming replication, MongoDB replica sets, and Cassandra rings, aligned to this module's concept map
a mistake log naming at least 8 recurring replication/partitioning design mistakes (async replication to the leader's own datacenter only, partition key is timestamp, sync replication with no follower, no fencing on failover, etc.)
a short memo on at least two incidents (real or hypothetical) where the topology or partition strategy caused the outage

Completion Standard

You have completed Module 3 when all of these are true:

you can classify a replication setup by topology, log format, and sync mode without looking it up
you can state what partition-tolerance forces a system to choose during a network fault
you can pick a partition key for a new workload and defend the choice in one paragraph
you can walk through a leader failover and name every place split-brain or data loss can enter
you can diagnose a stale-read bug to a specific missing guarantee and prescribe the fix
you have read at least one real Jepsen report end-to-end and can restate the violation in this module's vocabulary
you can map PostgreSQL, MongoDB, and Cassandra to the same concept map and call out the places they differ

If the replication diagram "looks right" but you cannot say what it survives and what it does not, the module is not complete.

Reading Policy

Concept pages are the main path.
Local book chunks (mostly DDIA Chapters 5 and 6) are selective reinforcement, not a second syllabus.
Read This Only If Stuck means try the concept page, self-check, and drill first.
External URLs (Jepsen, PostgreSQL docs, Cassandra docs) are validated and safe to open.
Because this module is the operational base for Modules 4 and 5, at least one failover walkthrough and one sharding-design writeup are required, not optional enrichment.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3, write a one-page "why replicate / why partition" memo
2	Concepts 4-5, draw single-leader and multi-leader diagrams from memory
3	Concept 6, simulate a quorum `W + R > N` by hand with a node failure
4	Concepts 7-8, classify three real databases by log format and sync mode
5	Concept 9 and Practice 2 (replication-lag diagnosis)
6	Concepts 10-11, design partitioning for two contrasting workloads
7	Concept 12 and Practice 3 (sharding scheme design)
8	Concepts 13-14, map PostgreSQL, MongoDB, Cassandra to the concept map
9	Concept 15 and Practice 1 (failover simulation)
10	Practice 4 (Jepsen analysis), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.

Build Your Own X — elective

The Kafka-like Distributed Log tutorial is the canonical project for this module: partitioning, replication, consumer offsets, leader election. The Blockchain tutorial is a contrasting replication model — every node holds the full log; consensus via proof-of-work rather than leader election. See Build Your Own X overview.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: Why Replicate and Partition​

Cluster 2: Replication Topologies​

Cluster 3: Replication Mechanics​

Cluster 4: Partitioning Strategies​

Cluster 5: Practical Systems​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​