Range Partitioning vs Hash Partitioning
What This Concept Is
Once you have committed to partitioning, you must pick a scheme that maps every key to a partition. Two schemes cover the vast majority of real systems:
- Range partitioning: assign contiguous key ranges to partitions (
A-F -> P0, G-M -> P1, N-S -> P2, T-Z -> P3). Preserves ordering. Range scans are local. Used by HBase, Bigtable, CockroachDB, and for time-series data in many systems. - Hash partitioning: hash the key with a well-distributed function and assign to partitions by
hash(key) mod Nor by ranges in the hash space. Destroys ordering but guarantees uniform load. Used by Cassandra, DynamoDB (under the hood), MongoDB (hashed shard keys), Redis Cluster.
Both have a third sibling worth knowing:
- Consistent hashing: hash keys and nodes onto the same ring; each key is owned by the next node clockwise. Minimizes data movement on node add/remove. Used by Cassandra, Riak, Memcached.
Range: Hash:
P0 [A..F] P1 [G..M] P2 [N..S] hash('alice') % 4 = 1 -> P1
P3 [T..Z] hash('bob') % 4 = 3 -> P3
hash('carol') % 4 = 0 -> P0
scans ordered by key within keys appear scattered; scans
each partition; cross-partition by original key require fan-out
scans possible
Why It Matters Here
This choice dictates:
- Whether range scans are cheap (range wins) or a fan-out query across every shard (hash loses).
- Whether writes can avalanche onto a single partition (range loses if the key is monotonic) or spread evenly (hash wins).
- How rebalancing works: range splits partitions in half when they grow; hash requires a consistent-hashing scheme to avoid rehashing everything.
Almost every "my shard is on fire" incident is either "we hash-partitioned and it's a single hot key" or "we range-partitioned by timestamp and writes go to the last shard."
Concrete Example
A time-series metrics store stores rows keyed by (metric_name, timestamp).
- Range by timestamp: current writes all land on the newest partition. That partition is a hotspot 100% of the time; old partitions are cold. Terrible.
- Hash by
(metric_name, timestamp): writes spread uniformly. Great for ingest. Terrible for "show me the last 1 hour of CPU" because that query now fans out to every partition. - Range by
metric_name, then range bytimestampwithin each metric: a typical composite scheme. Per-metric scans are local. Write load spreads across metrics, but a single popular metric can still hotspot. - Salted key (range on
hash_prefix + metric_name + timestamp): keeps the range ordering within a metric but adds artificial sharding.
Cassandra PRIMARY KEY ((metric_name), timestamp) with a hashed partition key is the canonical composite: hash on metric, range on timestamp within the metric.
Common Confusion / Misconception
"Hash partitioning kills hotspots forever." Only if the key space itself has no hotspots. hash(celebrity_user_id) still sends every celebrity-related write to one node. Hashing spreads keys, not load. Celebrity hotspots need application-level salting.
"Range partitioning is old." It is the right answer for ordered data (time series, log indexes, ranges of IDs). HBase and CockroachDB are built on range partitioning precisely because their workloads are range-heavy.
"Consistent hashing eliminates rebalancing." Only minimizes it. Adding one node still moves 1/N of the data. Node failure still requires moving the dead node's keys to living nodes.
How To Use It
Decision tree:
- Does the application need range scans on the key (time-series, alphabetical listings, lexicographic prefixes)?
- Yes: range partitioning.
- No: consider hash.
- Is the natural write key monotonic (timestamp, auto-increment ID)?
- Yes: hash on it, or salt it, or range on a different key.
- Are some keys vastly hotter than others (celebrities, admin tenants, top SKUs)?
- Yes: add salt or use composite keys to spread that one key across partitions.
- Will the cluster scale dynamically (add/remove nodes)?
- Yes: prefer consistent hashing over
hash mod N.
- Yes: prefer consistent hashing over
Check Yourself
- Why is "range partition by timestamp" almost always wrong for high-ingest time-series?
- Why is "hash partition by user_id" insufficient for a social network with celebrity users?
- What problem does consistent hashing solve that
mod Ndoes not? - What query pattern does range partitioning make fast that hash partitioning makes slow?
Mini Drill or Application
For each workload, pick a partition key and scheme, and name one hotspot risk:
- A log-analytics system ingesting 200k events/sec from 50 services.
- A B2B CRM with 1,000 tenants, where the top 3 tenants are each 100x the median.
- A Twitter-like social feed keyed by
(user_id, post_id). - A financial ticker storing per-symbol quotes for 8,000 symbols, mostly reads are "last 5 minutes for symbol X."
- A public URL-shortening service where the key is the short code.