Cache-Aware Programming: Locality and Memory Access Patterns

What This Concept Is

Once you know the cache exists, writing code for it becomes a discipline in its own right. The techniques are small in number:

row-major traversal -- walk memory in the order it is stored (inner loop strides by one element)
blocking / tiling -- process a matrix one B × B tile at a time so each tile's working set fits in L1 or L2
structure-of-arrays (SoA) -- if you only touch two fields of a struct, store each field in its own array so a line holds N copies of the hot field
alignment and padding -- align hot structures to 64 so they start at a line boundary; pad independent fields that are written by different threads so they do not share a line (false sharing)
prefetching -- hint the cache about a future load (__builtin_prefetch) so the line arrives before the load
streaming stores -- bypass the cache when you know the data will not be read back soon (_mm_stream_si128)

All of these come from one idea: make the next access the cache is about to miss on happen far enough ahead that it does not stall the core.

Why It Matters Here

Cache-aware programming is where theory meets a stopwatch. The differences it produces are routinely 5x-20x on real code with the same asymptotic complexity. It is also the most underestimated source of performance in software engineering -- most developers never measure it.

Concrete Example

Row-major vs column-major traversal of a 2D matrix stored row-major (C convention):

// Row-major: inner stride = sizeof(double), hits cache line after line
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        sum += A[i][j];

// Column-major over row-major storage: inner stride = N * sizeof(double)
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        sum += A[i][j];

For N = 4096, double (so 32 MiB matrix, does not fit in L2):

Row-major touches each 64-byte line once; ~8 elements per miss; effective bandwidth = peak DRAM read.
Column-major touches a different line every iteration; ~1 element per miss; effective bandwidth = peak DRAM divided by 8 -- and that is before the TLB starts missing.

Measured on a typical x86_64 desktop: 0.4 seconds row-major vs 3-6 seconds column-major. Identical computation, identical output.

Blocking turns a naïve O(N^3) matrix multiply with O(N^3) misses into one with O(N^3 / B) misses, where B^2 fits in L1:

for (int ii = 0; ii < N; ii += B)
    for (int jj = 0; jj < N; jj += B)
        for (int kk = 0; kk < N; kk += B)
            for (int i = ii; i < ii + B; ++i)
                for (int j = jj; j < jj + B; ++j)
                    for (int k = kk; k < kk + B; ++k)
                        C[i][j] += A[i][k] * B[k][j];

Choose B so that three B × B blocks (one for each of A, B, C) fit in L1 or L2. With B = 32 and double, three blocks = 24 KiB ≈ L1. Empirically this yields 3x-10x speedups over the unblocked version.

Common Confusion / Misconception

"Big-O is big-O; locality is just micro-optimization." For matrix multiply this is dramatically false: the cost model changes which algorithm is fastest at realistic N. The same confusion explains why std::vector beats std::list for almost all workloads below a million elements -- the list's O(1) insert is outweighed by the pointer-chasing misses.

Another trap: "I added __builtin_prefetch and got no speedup." Prefetch only helps when (a) the hardware prefetcher was not already doing the job, and (b) the prefetch distance is tuned (too close = useless, too far = evicted before use). The hardware is usually smart enough to prefetch linear walks.

How To Use It

Start with locality in the large: pick data layouts and traversal orders that stride by 1 element.
For nested loops on big arrays, apply loop interchange when the innermost loop's stride through memory is larger than a line.
When even a single pass of the innermost structure overflows L1, apply blocking.
For reduce/update patterns, split accumulators and use SoA so independent work does not share lines.
Only reach for explicit prefetches and streaming stores after the layout is right. Those are final-percent tools, not structural fixes.

Check Yourself

Why does row-major traversal of a row-major matrix beat the column-major order by almost an order of magnitude?
What does blocking do to the miss count for matrix multiply, and why is it dependent on L1 size?
When is AoS (array of structs) worse than SoA (struct of arrays)?
Why is explicit prefetching usually the last thing to try, not the first?

Mini Drill or Application

Implement and time three versions of a 2D sum:

double sumA(const double *A, int N) { /* inner loop over j */ }
double sumB(const double *A, int N) { /* inner loop over i */ }
double sumC(const double *A, int N) { /* blocked, B=64 */ }

For N ∈ {256, 1024, 4096}, print time-per-element and interpret. Then add perf stat -e L1-dcache-load-misses,LLC-load-misses to confirm which version misses most.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​