Rebalancing, Shard Splitting, and Hotspots
What This Concept Is
A partitioned cluster is never in a permanent state. Nodes are added, nodes fail, data grows unevenly. Rebalancing is the process of moving partition ownership between nodes so that load and storage stay roughly even. The three canonical strategies:
- Fixed number of partitions (many small partitions per node): pick a large partition count up front (say 1000); assign a few dozen to each node. To add a node, move some partitions to it. To remove, re-assign its partitions. Used by Couchbase, Voldemort, Elasticsearch. Simple and predictable.
- Dynamic partitioning (split when too large): start with one (or a few) partitions. When a partition exceeds a threshold, split it in half. When it falls below, merge it. Used by HBase, MongoDB (chunk splits), CockroachDB (range splits), Bigtable.
- Partitioning proportional to nodes: one fixed partitions-per-node ratio. Adding a node creates new partitions by splitting existing ones. Cassandra's original design.
Hotspots are partitions that receive disproportionate traffic -- either because a single key is hot (celebrity user, admin tenant) or because the partitioning scheme clumps writes (time-series on a timestamp key). Rebalancing can't fix a single-key hotspot; the partitioning scheme itself must change.
Fixed partitions: Dynamic:
P001..P1000 (pre-allocated) one partition
-> splits when > threshold
Node A: P001..P250 -> merges when < threshold
Node B: P251..P500
Node C: P501..P750 Partition tree grows/shrinks
Node D: P751..P1000 with data volume
Why It Matters Here
Rebalancing determines:
- Whether adding capacity is a planned operation or a midnight firefight.
- How much data moves when one node dies (ideally
1/Nof the dataset; pathologically, all of it). - Whether hotspots are detectable before they take down the cluster.
- Whether the operation is "manual" (DBA initiates) or "automatic" (system decides) -- both have failure modes.
A design that cannot be rebalanced online is a design that cannot grow.
Concrete Example
A Cassandra cluster with 6 nodes using virtual nodes (vnodes).
- Initially: each node owns 256 vnodes; data is roughly uniform.
- Add a 7th node: Cassandra assigns 256 vnodes to it, stealing roughly
1/7of the vnodes from the existing 6 nodes. Data streams from the old owners to the new node over hours. - During streaming, reads and writes continue; the old owner and new owner each serve their portion.
- If a key suddenly becomes hot (a celebrity post), no amount of vnode redistribution helps -- every replica for that key now sees the same surge. The fix is application-level: cache the hot key, CDN it, or add per-key sharding (distinct fanout partitions).
Contrast with HBase (dynamic splits): a partition ("region") that grows past 10 GB is split at the median row key into two regions. The two halves are reassigned to different nodes.
Common Confusion / Misconception
"hash mod N is fine for rebalancing." It is terrible. Changing N moves almost every key. Consistent hashing or fixed-partition schemes move only 1/N of keys.
"Auto-rebalancing is always safer." Sometimes dangerous: a failed node that comes back briefly can trigger the system to re-replicate its data to new nodes, then trigger another rebalance when the node returns. Large systems often default to operator-confirmed rebalance.
"Hotspots only happen in poorly designed systems." They happen in every production system. The New York Times front page, Justin Bieber's Twitter account, Black Friday on Amazon -- the shape of human attention is power-law, and your database reflects it.
How To Use It
When designing for rebalancing, answer:
- What is my partition count? Is it fixed or dynamic?
- How much data moves when I add one node? One rack? One AZ?
- Is rebalancing automatic, semi-automatic, or manual?
- How do I detect a hotspot in production? (partition-level metrics on requests/sec, bytes written, CPU)
- What is my mitigation playbook when one partition becomes hot? Cache, salt the key, split the hot partition, throttle the hot client?
For hotspot detection, add per-partition dashboards:
writes/sec per partition:
P0 #### 100 w/s
P1 #### 100 w/s
P2 ################### 5000 w/s <-- hotspot
P3 #### 100 w/s
Check Yourself
- Why is
hash(key) mod Na bad rebalancing scheme compared to consistent hashing or fixed partitions? - When dynamic partitioning splits a partition in half at the median key, where does the data physically go?
- Why can't any rebalancing scheme fix a single-key hotspot?
- Give two reasons an operator might prefer manual to automatic rebalancing.
Mini Drill or Application
For each incident, diagnose whether rebalancing helps and propose a concrete fix:
- One shard of a 20-shard MongoDB cluster is 10x the size of the others. Data model: sharded on
tenant_id, one tenant dominates. - A Cassandra cluster is adding nodes but read latency is getting worse during the rebalance.
- A range-partitioned time-series store has all writes landing on the "latest" partition.
- A hash-partitioned key-value store sees one partition at 95% CPU; the rest are idle. The application tells you it's "just the homepage."
- A fixed-partitions cluster wants to go from 3 nodes to 5. What happens?