Cache-Aware Programming: Locality and Memory Access Patterns
What This Concept Is
Once you know the cache exists, writing code for it becomes a discipline in its own right. The techniques are small in number:
- row-major traversal -- walk memory in the order it is stored (inner loop strides by one element)
- blocking / tiling -- process a matrix one
B × Btile at a time so each tile's working set fits in L1 or L2 - structure-of-arrays (SoA) -- if you only touch two fields of a struct, store each field in its own array so a line holds N copies of the hot field
- alignment and padding -- align hot structures to
64so they start at a line boundary; pad independent fields that are written by different threads so they do not share a line (false sharing) - prefetching -- hint the cache about a future load (
__builtin_prefetch) so the line arrives before the load - streaming stores -- bypass the cache when you know the data will not be read back soon (
_mm_stream_si128)
All of these come from one idea: make the next access the cache is about to miss on happen far enough ahead that it does not stall the core.
Why It Matters Here
Cache-aware programming is where theory meets a stopwatch. The differences it produces are routinely 5x-20x on real code with the same asymptotic complexity. It is also the most underestimated source of performance in software engineering -- most developers never measure it.
Concrete Example
Row-major vs column-major traversal of a 2D matrix stored row-major (C convention):
// Row-major: inner stride = sizeof(double), hits cache line after line
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
sum += A[i][j];
// Column-major over row-major storage: inner stride = N * sizeof(double)
for (int j = 0; j < N; ++j)
for (int i = 0; i < N; ++i)
sum += A[i][j];
For N = 4096, double (so 32 MiB matrix, does not fit in L2):
- Row-major touches each 64-byte line once; ~8 elements per miss; effective bandwidth = peak DRAM read.
- Column-major touches a different line every iteration; ~1 element per miss; effective bandwidth = peak DRAM divided by 8 -- and that is before the TLB starts missing.
Measured on a typical x86_64 desktop: 0.4 seconds row-major vs 3-6 seconds column-major. Identical computation, identical output.
Blocking turns a naïve O(N^3) matrix multiply with O(N^3) misses into one with O(N^3 / B) misses, where B^2 fits in L1:
for (int ii = 0; ii < N; ii += B)
for (int jj = 0; jj < N; jj += B)
for (int kk = 0; kk < N; kk += B)
for (int i = ii; i < ii + B; ++i)
for (int j = jj; j < jj + B; ++j)
for (int k = kk; k < kk + B; ++k)
C[i][j] += A[i][k] * B[k][j];
Choose B so that three B × B blocks (one for each of A, B, C) fit in L1 or L2. With B = 32 and double, three blocks = 24 KiB ≈ L1. Empirically this yields 3x-10x speedups over the unblocked version.
Common Confusion / Misconception
"Big-O is big-O; locality is just micro-optimization." For matrix multiply this is dramatically false: the cost model changes which algorithm is fastest at realistic N. The same confusion explains why std::vector beats std::list for almost all workloads below a million elements -- the list's O(1) insert is outweighed by the pointer-chasing misses.
Another trap: "I added __builtin_prefetch and got no speedup." Prefetch only helps when (a) the hardware prefetcher was not already doing the job, and (b) the prefetch distance is tuned (too close = useless, too far = evicted before use). The hardware is usually smart enough to prefetch linear walks.
How To Use It
- Start with locality in the large: pick data layouts and traversal orders that stride by 1 element.
- For nested loops on big arrays, apply loop interchange when the innermost loop's stride through memory is larger than a line.
- When even a single pass of the innermost structure overflows L1, apply blocking.
- For reduce/update patterns, split accumulators and use SoA so independent work does not share lines.
- Only reach for explicit prefetches and streaming stores after the layout is right. Those are final-percent tools, not structural fixes.
Check Yourself
- Why does row-major traversal of a row-major matrix beat the column-major order by almost an order of magnitude?
- What does blocking do to the miss count for matrix multiply, and why is it dependent on L1 size?
- When is AoS (array of structs) worse than SoA (struct of arrays)?
- Why is explicit prefetching usually the last thing to try, not the first?
Mini Drill or Application
Implement and time three versions of a 2D sum:
double sumA(const double *A, int N) { /* inner loop over j */ }
double sumB(const double *A, int N) { /* inner loop over i */ }
double sumC(const double *A, int N) { /* blocked, B=64 */ }
For N ∈ {256, 1024, 4096}, print time-per-element and interpret. Then add perf stat -e L1-dcache-load-misses,LLC-load-misses to confirm which version misses most.
Read This Only If Stuck
- Computer Organization and Design: 5.3 Measuring and Improving Cache Performance
- Computer Organization and Design: 5.3 Measuring and Improving Cache Performance (Part 2)
- Computer Organization and Design: 5.8 Parallelism and Memory Hierarchies -- Cache Coherence
- Computer Organization and Design: 5.11 Fallacies and Pitfalls
- External: Ulrich Drepper, "What Every Programmer Should Know About Memory" (LWN)
- External: Agner Fog, Software Optimization Resources