Skip to main content

Module 3: Replication & Partitioning

Primary text: Designing Data-Intensive Applications (Kleppmann), Chapters 5 and 6
Selective support: Database Internals (Petrov) Part II, Distributed Systems: Concepts and Design (Coulouris) for theoretical framing, Database System Concepts (Silberschatz) for the relational-replication view

This guide is the primary teacher. You do not need to read DDIA Chapters 5 and 6 end-to-end to complete this module. You do need to be operationally strong at reasoning about replication topologies, partitioning strategies, and failure modes before entering Module 4 (Transactions and Consistency).


Scope of This Module

This module is where storage starts to live on more than one machine. It is not about any specific database; it is about the patterns that every distributed database instantiates.

What it covers in depth:

  • why systems replicate (availability, throughput, geolocation) and why systems partition (scale beyond one node)
  • the CAP intuition: network partitions are a kind of fault you cannot opt out of, so you choose between consistency and availability when one happens
  • single-leader, multi-leader, and leaderless replication as three distinct algorithms with three distinct failure modes
  • statement-based, logical (row-based), and physical replication log formats and when each is used
  • synchronous vs asynchronous replication and what each promises about durability
  • replication lag and the application-visible anomalies it creates (stale reads, non-monotonic reads, causality violations)
  • range vs hash partitioning as two opposing answers to "how do we split the keyspace"
  • local vs global secondary indexes in partitioned data
  • rebalancing, shard splitting, and how to avoid hotspots
  • request routing: how a client finds the node that owns a key
  • failover and split-brain: what happens when the leader dies or is thought to have died

What it deliberately does not try to finish here:

  • formal consistency models (linearizability, causal consistency, serializability): Module 4
  • distributed transactions and 2PC: Module 4
  • consensus algorithms (Raft, Paxos) in depth: Module 5
  • time, clocks, and ordering in distributed systems: Module 5
  • stream processing and derived data pipelines: later semesters

This module is the operational substrate that Module 4 (consistency) and Module 5 (distributed-systems fundamentals) formalize.


Before You Start

Answer these closed-book before starting the main path:

  1. Why do we replicate data? Give three different reasons, each with a distinct goal.
  2. Why do we partition (shard) data? What breaks if a single node has to hold the entire dataset?
  3. If a network partition cuts two datacenters apart, and the application requires strong consistency, what must happen to requests on the minority side?
  4. Given a leader-based database that uses asynchronous replication, what does it mean for a client to "read its own writes"? Why might that fail?
  5. You have 1 billion users keyed by user_id. Why is range partitioning by user_id likely a bad idea?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path.

2-3 solid answers

  • Continue, but expect extra time in Clusters 2 (topologies) and 4 (partitioning strategies).

0-1 solid answers

  • Revisit Module 1 (relational databases) and Module 2 (storage engines). This module assumes you know what a row, an index, and a log look like on a single node.

What This Module Is For

Replication and partitioning is the language every later distributed-systems claim is made in. Throughout the program you will repeatedly be asked:

  • will this design survive one node failing? two? the whole rack?
  • which consistency model does this replication topology give me?
  • if I add a shard, how much data moves, and does the system stay available while it moves?
  • given a workload, which key do I partition on, and why?
  • can I reason about this failure before it happens, or only after the postmortem?

This module builds the reasoning needed for:

  • transactions and consistency models (Module 4)
  • distributed-systems fundamentals, time, and consensus (Module 5)
  • stream processing, derived data, and event sourcing (later semesters)
  • real-world incident response on replicated databases

You are learning to read a replication and partitioning design and know in advance what it can and cannot survive.


Concept Map


How To Use This Module

Work in order. Clusters 1-3 are about replication; Cluster 4 is about partitioning; Cluster 5 is where both come together in real systems.

Cluster 1: Why Replicate and Partition

OrderConceptTypeFocus
1Replication Goals: Availability, Throughput, GeolocationPRIMARYThree distinct reasons to replicate and how they shape topology
2Partitioning Goals: Scaling Beyond One NodePRIMARYWhy data outgrows a single machine and what partitioning buys
3The CAP Intuition: Partition-Tolerance Is MandatoryPRIMARYCAP as a runtime choice under a partition, not a menu of three

Cluster mastery check: Can you explain to a new teammate why a system replicates, why it partitions, and what the CAP theorem actually forces you to choose?

Cluster 2: Replication Topologies

OrderConceptTypeFocus
4Single-Leader ReplicationPRIMARYOne leader, many followers; failover; read scaling
5Multi-Leader ReplicationPRIMARYConflicts, topologies (all-to-all, ring, star), convergence rules
6Leaderless Replication (Dynamo-Style)PRIMARYQuorums W + R > N, read repair, hinted handoff

Cluster mastery check: Given an application (geo-distributed writes vs single-region reads vs mostly-offline mobile) can you pick one topology and defend the choice with a failure-mode argument?

Cluster 3: Replication Mechanics

OrderConceptTypeFocus
7Replication Log Formats: Statement, Logical, PhysicalPRIMARYThree log types and the non-determinism trap
8Synchronous vs Asynchronous ReplicationPRIMARYDurability vs availability tradeoff, semi-sync compromise
9Replication Lag: Read-Your-Writes and Monotonic ReadsPRIMARYStale reads, anomalies, client-visible guarantees

Cluster mastery check: Given a replication setup, can you predict which reads will be stale and what guarantees the application must add on top?

Cluster 4: Partitioning Strategies

OrderConceptTypeFocus
10Range vs Hash PartitioningPRIMARYOrdered scans vs uniform spread; picking a partition key
11Secondary Indexes: Local vs GlobalPRIMARYDocument-partitioned vs term-partitioned tradeoffs
12Rebalancing, Shard Splitting, and HotspotsPRIMARYAdding nodes without moving everything; hotspot diagnosis

Cluster mastery check: Given a workload, can you pick a partition key, justify range or hash, and plan how rebalancing will work as the cluster grows?

Cluster 5: Practical Systems

OrderConceptTypeFocus
13Request Routing: Gossip, Service Discovery, CoordinatorsSUPPORTINGThree routing approaches; ZooKeeper/etcd coordination
14Case Studies: PostgreSQL, MongoDB, CassandraPRIMARYMapping three real systems to this module's concepts
15Failover and Split-BrainPRIMARYDetecting dead leaders, quorum fencing, STONITH

Cluster mastery check: Can you walk through a failover scenario in PostgreSQL, MongoDB, and Cassandra and point at what would cause split-brain in each?

Then work these practice pages:

OrderPractice pathFocus
1Replication Topologies LabDraw and walk through single-leader, multi-leader, and leaderless topologies under failure
2Partitioning and Rebalancing WorkshopPick partition keys, schemes, and rebalancing plans for contrasting workloads
3Replication Anomalies ClinicDiagnose stale-read bugs and prescribe the right session guarantee
4Code KatasFailover simulation, lag diagnosis, sharding design, and Jepsen report analysis

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Name three distinct reasons to replicate and three distinct reasons to partition, and tell a designer which of the two a given problem calls for.
  2. Explain the CAP theorem precisely: what partition-tolerance means, what the C/A tradeoff becomes during a partition, and why "pick 2 of 3" is misleading.
  3. Draw the write and read paths for single-leader, multi-leader, and leaderless replication, and name their failure modes.
  4. Read a replication log (statement, logical, physical) and explain which non-determinism each format protects against.
  5. Predict whether a given replication setup is synchronous, semi-synchronous, or asynchronous, and what durability guarantee it gives on leader failure.
  6. Given replication lag, name the anomalies a client can see and prescribe a guarantee (read-your-writes, monotonic reads, consistent prefix) that rules each out.
  7. Choose range vs hash partitioning for a workload and justify the pick with a scan-access vs load-balance argument.
  8. Explain when a local (document-partitioned) secondary index is enough and when you need a global (term-partitioned) one.
  9. Describe at least two rebalancing strategies (fixed partitions, dynamic partitioning, proportional to nodes) and what each one avoids.
  10. Walk through a leader failover step by step, name the ways split-brain can happen, and describe how fencing, quorum-based election, or STONITH prevents each.

Outputs

  • a replication-topology notebook: for at least 6 real or described systems, name the topology, the log format, the sync mode, and the expected failure modes
  • a partition-design log: for at least 6 workloads, record partition key, strategy (range/hash), rebalancing plan, and hotspot risks
  • at least 3 worked failover scenarios with a timeline showing writes before, during, and after the failover
  • at least 3 read-anomaly diagnoses naming the client-visible guarantee needed and the server-side mechanism that provides it
  • a written summary of at least one real Jepsen report, identifying the invariant tested, the violation found, and the root cause
  • a case-study comparison of PostgreSQL streaming replication, MongoDB replica sets, and Cassandra rings, aligned to this module's concept map
  • a mistake log naming at least 8 recurring replication/partitioning design mistakes (async replication to the leader's own datacenter only, partition key is timestamp, sync replication with no follower, no fencing on failover, etc.)
  • a short memo on at least two incidents (real or hypothetical) where the topology or partition strategy caused the outage

Completion Standard

You have completed Module 3 when all of these are true:

  • you can classify a replication setup by topology, log format, and sync mode without looking it up
  • you can state what partition-tolerance forces a system to choose during a network fault
  • you can pick a partition key for a new workload and defend the choice in one paragraph
  • you can walk through a leader failover and name every place split-brain or data loss can enter
  • you can diagnose a stale-read bug to a specific missing guarantee and prescribe the fix
  • you have read at least one real Jepsen report end-to-end and can restate the violation in this module's vocabulary
  • you can map PostgreSQL, MongoDB, and Cassandra to the same concept map and call out the places they differ

If the replication diagram "looks right" but you cannot say what it survives and what it does not, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks (mostly DDIA Chapters 5 and 6) are selective reinforcement, not a second syllabus.
  • Read This Only If Stuck means try the concept page, self-check, and drill first.
  • External URLs (Jepsen, PostgreSQL docs, Cassandra docs) are validated and safe to open.
  • Because this module is the operational base for Modules 4 and 5, at least one failover walkthrough and one sharding-design writeup are required, not optional enrichment.

Suggested Weekly Flow

DayWork
1Concepts 1-3, write a one-page "why replicate / why partition" memo
2Concepts 4-5, draw single-leader and multi-leader diagrams from memory
3Concept 6, simulate a quorum W + R > N by hand with a node failure
4Concepts 7-8, classify three real databases by log format and sync mode
5Concept 9 and Practice 2 (replication-lag diagnosis)
6Concepts 10-11, design partitioning for two contrasting workloads
7Concept 12 and Practice 3 (sharding scheme design)
8Concepts 13-14, map PostgreSQL, MongoDB, Cassandra to the concept map
9Concept 15 and Practice 1 (failover simulation)
10Practice 4 (Jepsen analysis), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Build Your Own X — elective

The Kafka-like Distributed Log tutorial is the canonical project for this module: partitioning, replication, consumer offsets, leader election. The Blockchain tutorial is a contrasting replication model — every node holds the full log; consensus via proof-of-work rather than leader election. See Build Your Own X overview.

Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread