Cache Organization: Lines, Sets, Associativity, and Replacement

What This Concept Is

A hardware cache is an associative lookup over lines (fixed-size blocks of memory) keyed by a tag derived from the physical address. To check whether an address hits, the cache splits the address into three fields:

|  tag  |  set index  |  offset  |

offset -- which byte within the line (log2(line size) bits; usually 6 bits for a 64-byte line)
set index -- which set of lines to probe (log2(num_sets) bits)
tag -- everything left; compared against the stored tag in each way of the set

Key geometry parameters:

line size (typically 64 bytes)
number of sets
associativity -- how many lines live in one set (1 = direct-mapped; N = N-way set-associative; unlimited = fully associative)
total size = line size × sets × ways

When a set is full and we need to bring in a new line, a replacement policy chooses a victim. Real caches use approximations of LRU (pseudo-LRU, tree-PLRU, RRIP). On a miss the cache fetches the full line from the next level, not just the requested byte.

Why It Matters Here

You cannot reason about cache hits and misses without knowing the geometry. Two arrays that happen to land in the same set can repeatedly evict each other even though the total size is tiny -- this is conflict misses, and it is the reason power-of-two-strided walks over large buffers often fall off a cliff.

Knowing line size also drives false sharing on multi-core: two threads writing distinct variables that share one 64-byte line ping-pong the line between L1 caches. The variables look independent; the line is not.

Concrete Example

Consider an L1 cache with 32 KiB, 64-byte lines, 8-way associative:

sets = 32 KiB / (64 B × 8) = 64 sets
offset bits = 6 (for 64-byte line)
index bits = 6 (for 64 sets)
tag = remaining upper bits

Address 0x0000_0000_0001_2340:

binary:   ... 0001 0010 0011 0100 0000
offset:   000000        -> byte 0 of line
index:    001101        -> set 13
tag:      ...0001 0010 0       -> compared against 8 ways in set 13

If any of the 8 ways in set 13 holds this tag, hit. Otherwise, miss -- fetch 64 bytes starting at 0x0000_0000_0001_2340 & ~0x3F = 0x0000_0000_0001_2300 and evict the least-recently-used way.

A pathological access sequence: stride of 4 KiB (= 64 sets × 64 B) through a large array. Every access lands in set 0. An 8-way cache can only hold 8 of those lines; the 9th evicts the 1st, and if you later re-read the 1st you miss. Total cache is 32 KiB, but only 512 bytes are usable under this access pattern.

Common Confusion / Misconception

"Direct-mapped is obviously worse than fully associative, so everyone should be fully associative." Fully associative caches require comparing the tag against every line in parallel, which explodes power and area. Real caches settle at 4-16 ways at each level, which handles realistic access patterns without paying the fully-associative cost. Direct-mapped is faster per-access but loses to conflict misses, which is why L1 is usually 4-8 way, L2 8-16 way, L3 16-way or more.

Another trap: caches work on physical addresses at most levels. A virtual address translation (TLB lookup) usually happens in parallel with the L1 probe. When you think "same address hits," confirm the physical address is actually the same.

How To Use It

Look up your machine's cache geometry: getconf -a | grep CACHE, lscpu, or /sys/devices/system/cpu/cpu0/cache/index*/.
When a workload has mysterious drops, check the stride. Power-of-two strides larger than the cache can produce near-100% miss rates on tiny working sets.
Keep hot data on 64-byte lines. Split hot fields from cold ones with alignas(64) if needed.
Use perf stat -e cache-misses,cache-references and perf c2c to catch false sharing.

Check Yourself

How do you split a physical address into tag, set, and offset for a given cache geometry?
Why does a direct-mapped cache perform badly on strided access with stride = cache size?
What is false sharing and why does it survive single-threaded correctness tests?
Why does the cache fetch a whole line, not just the byte you requested?

Mini Drill or Application

Compute hit/miss for these sequences on a 16 KiB, 64-byte line, 4-way cache (64 sets):

Sequence A:  addresses 0x000, 0x040, 0x080, 0x0C0, ..., 0x3C0   (sequential lines)
Sequence B:  addresses 0x0000, 0x4000, 0x8000, 0xC000, 0x10000  (stride 16 KiB)
Sequence C:  addresses 0x000, 0x040, 0x000, 0x040                (hot pair)

For each, count hits, misses, and identify the miss type (compulsory, conflict, capacity).

Extra credit: repeat the analysis treating the cache as fully associative with LRU. Which missed accesses would now hit? The difference is the conflict-miss cost of the actual geometry.

The Three C's of Cache Misses

Every miss classifies as one of:

Compulsory (cold) -- the line has never been in this cache before. Unavoidable except via prefetch.
Capacity -- the working set exceeds the total cache size. Would miss even in a fully associative cache.
Conflict -- would have hit in a fully associative cache but did not hit here because too many lines mapped to the same set.

Diagnosing which category dominates tells you which optimization pays off: prefetching for compulsory, blocking for capacity, padding/stride-shifting for conflict.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

The Three C's of Cache Misses​

Read This Only If Stuck​