LSM-Trees: SSTables, Memtables, Compaction
What This Concept Is
A Log-Structured Merge tree (LSM-tree) is a write-optimized index. It treats writes as an ever-growing log and periodically reorganizes that log in the background. It has three pieces:
- Memtable: an in-memory sorted structure (usually a skiplist or balanced tree). New inserts land here. Writes are fast because they never touch disk in the foreground.
- Write-ahead log (WAL): every memtable write is first appended to a WAL so a crash does not lose data. The WAL is a pure sequential file.
- SSTable (Sorted String Table): when the memtable gets large enough, it is flushed to disk as an immutable, sorted file of key-value pairs. Once written, SSTables are never modified, only replaced.
Reads have to look in multiple places: the memtable first, then progressively older SSTables. To avoid the number of SSTables growing forever, the engine runs compaction in the background, merging overlapping SSTables into newer, larger ones and discarding overwritten or deleted entries.
Two common compaction strategies:
- Leveled (RocksDB default): SSTables are organized in levels
L0, L1, ..., Ln. Each level has a size budget, usually10xthe previous. When a level overflows, the engine picks a file and merges it into the next level. - Size-tiered (Cassandra default): SSTables of similar size are merged together, producing a smaller number of larger files.
Concrete Example -- Compaction Timeline
Each row below shows state after an event. M is the memtable; Li is level i. SST:<range> is a single SSTable.
T0 write batch 1 M=[a..d] L0=[] L1=[]
T1 write batch 2 M=[a..g] L0=[] L1=[]
T2 memtable full -> flush M=[] L0=[SST:a..g] L1=[]
T3 write batch 3 M=[h..k] L0=[SST:a..g] L1=[]
T4 flush M=[] L0=[SST:a..g,
SST:h..k] L1=[]
T5 L0 overflow -> compact to L1 M=[] L0=[] L1=[SST:a..k]
T6 more writes + flushes M=[...] L0=[SST:b..e,
SST:m..p] L1=[SST:a..k]
T7 L1 overflow -> compact L1+L0 M=[] L0=[] L1=[SST:a..p]
(rewrites SST:a..k merged with the new L0 files)
Notice two things. (1) A single logical key b may be rewritten many times across its life, once per level it moves through -- this is write amplification. (2) Between flushes and compactions, the same key can live in multiple SSTables at once, so reads must check each until a "freshest" version is found or all are exhausted.
Why It Matters Here
LSM-trees are the default index structure behind RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, and a growing list of key-value and NewSQL engines. They matter because:
- Writes are sequential and batched. A B+-tree insert does
O(log n)random writes per row; an LSM absorbs many writes into a memtable flush that writes one sequential SSTable. - Reads are slower on average than a B+-tree but can be made competitive with Bloom filters and caching.
- Space and write amplification can be tuned: leveled compaction lowers space amplification at the cost of more write amplification, and tiered compaction does the reverse.
This is the fundamental trade that divides modern storage engines.
The Three Amplifications
Every LSM design trades three numbers:
- Read amplification: reads needed per logical read (number of SSTables + memtable consulted). Bloom filters and per-SSTable caches reduce this.
- Write amplification: bytes written to disk per byte of logical write. Dominated by compaction.
- Space amplification: bytes on disk per byte of logical data. Caused by obsolete versions and tombstones not yet reclaimed.
A pure append-only log has 1x write amp but unbounded space and read amp. Leveled LSM typically runs at write amp 10-30x, space amp around 1.1-1.5x, read amp ~level count before Bloom filters.
Common Confusion / Misconception
"LSM-trees never have random I/O." They have random reads (you may hit any SSTable to find a key). The argument is that writes are sequential, and the read cost is softened by Bloom filters and block caches. Write amp, however, is very real, and sustained heavy writes can saturate compaction and cause write stalls.
"Compaction is just cleanup." It is also how deletes take effect. A delete in an LSM is a tombstone -- a marker that says "this key is gone." The tombstone does not free the old value until a compaction merges through every SSTable that still contains it. Until then the data is logically gone but physically present, which matters for disk usage and for secure-delete semantics.
How To Use It
When reasoning about an LSM workload:
- Classify writes as point writes, batched writes, or blind overwrites.
- Estimate flush frequency from memtable size and write rate.
- Sketch the compaction tree shape (leveled vs tiered, level sizes).
- Decide which of read / write / space amp you are willing to pay; they are a zero-sum triangle.
- Add Bloom filters to any read path that spans many SSTables (see next concept).
Check Yourself
- Why are sequential writes cheaper on real hardware, and how does an LSM exploit that?
- Why do deletes not immediately free space, and when does space actually come back?
- Why is write amplification sometimes a feature (durability, compression) and sometimes a pathology?
Mini Drill or Application
Given: memtable size 64 MB, sustained write rate 10 MB/s, L0 compaction trigger at 4 SSTables, L1 size budget 1 GB, 10x growth per level.
- How often does the memtable flush?
- How many L0 SSTables accumulate before a compaction to L1?
- Roughly what is the write amplification from a logical byte written to a byte finally resting in L3?
- What happens to sustained write throughput if compaction is CPU-bound and cannot keep up?