Memory Hierarchy and Cache Workshop

Retrieval Prompts

Give order-of-magnitude latencies for L1, L2, L3, DRAM, NVMe, and rotating disk.
Define temporal and spatial locality in one sentence each.
Decompose a physical address into tag, set, offset for a 32 KiB, 8-way, 64-byte line cache.
State the difference between compulsory, conflict, and capacity misses.
Define false sharing.

Compare and Distinguish

Separate these pairs clearly:

temporal vs spatial locality
direct-mapped vs fully associative
compulsory vs conflict vs capacity miss
AoS vs SoA
row-major vs column-major traversal

Common Mistake Check

For each statement, identify the error:

"An SSD's low latency means cache hierarchy is no longer important."
"Prefetching always helps because it brings data in earlier."
"If two threads write different variables they cannot interfere."
"std::list<T> is faster than std::vector<T> for insertion-heavy workloads regardless of size."
"Cache misses are rare for code that has good big-O complexity."

Measurable Evidence: Row-Major vs Column-Major

Write a C program matrix_traverse.c:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 4096
static double A[N][N];

double sum_row_major(void) {
    double s = 0;
    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j) s += A[i][j];
    return s;
}

double sum_col_major(void) {
    double s = 0;
    for (int j = 0; j < N; ++j)
        for (int i = 0; i < N; ++i) s += A[i][j];
    return s;
}

int main(void) {
    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j) A[i][j] = (double)(i + j);
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    double a = sum_row_major();
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double dt_r = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    double b = sum_col_major();
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double dt_c = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("row=%.3fs col=%.3fs ratio=%.2fx  (sums %.0f / %.0f)\n",
           dt_r, dt_c, dt_c / dt_r, a, b);
    return 0;
}

Compile with gcc -O2 -o matrix_traverse matrix_traverse.c and run multiple times. Record the ratio.

Target outcomes:

observe row-major is typically 5x-15x faster than column-major
explain the ratio in terms of 64-byte cache lines and DRAM bandwidth
confirm with perf stat -e L1-dcache-load-misses,LLC-load-misses that the column-major run has roughly 8x more L1 misses

Blocking Practice

Extend the experiment to a matrix multiply C = A * B for N = 1024:

Write a naïve ijk-loop version.
Write a blocked version with B = 32 and compare.
Report time and GFLOPS (2 × N^3 / time).

Target outcomes:

2x-5x speedup from blocking at N = 1024
larger speedup as N grows

Address Decomposition Drill

Given a 64 KiB, 8-way, 64-byte line cache (128 sets):

Decompose addresses 0x0001_4040, 0x0001_C040, 0x0002_4040 into tag/set/offset.
For sequence [0x0001_4040, 0x0001_4080, 0x0001_C040, 0x0002_4040, 0x0001_4040], predict hits and misses assuming the cache starts empty.
Name the miss type for each miss.

Evidence Check

This page is complete only if you produced at least one benchmark report comparing row-major and column-major, with measured times and a one-paragraph explanation citing cache line size and effective bandwidth.

Retrieval Prompts​

Compare and Distinguish​

Common Mistake Check​

Measurable Evidence: Row-Major vs Column-Major​

Blocking Practice​

Address Decomposition Drill​

Evidence Check​