Memory Hierarchy and Cache Workshop
Retrieval Prompts
- Give order-of-magnitude latencies for L1, L2, L3, DRAM, NVMe, and rotating disk.
- Define temporal and spatial locality in one sentence each.
- Decompose a physical address into tag, set, offset for a
32 KiB,8-way,64-byte linecache. - State the difference between compulsory, conflict, and capacity misses.
- Define false sharing.
Compare and Distinguish
Separate these pairs clearly:
- temporal vs spatial locality
- direct-mapped vs fully associative
- compulsory vs conflict vs capacity miss
- AoS vs SoA
- row-major vs column-major traversal
Common Mistake Check
For each statement, identify the error:
- "An SSD's low latency means cache hierarchy is no longer important."
- "Prefetching always helps because it brings data in earlier."
- "If two threads write different variables they cannot interfere."
- "
std::list<T>is faster thanstd::vector<T>for insertion-heavy workloads regardless of size." - "Cache misses are rare for code that has good big-O complexity."
Measurable Evidence: Row-Major vs Column-Major
Write a C program matrix_traverse.c:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 4096
static double A[N][N];
double sum_row_major(void) {
double s = 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j) s += A[i][j];
return s;
}
double sum_col_major(void) {
double s = 0;
for (int j = 0; j < N; ++j)
for (int i = 0; i < N; ++i) s += A[i][j];
return s;
}
int main(void) {
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j) A[i][j] = (double)(i + j);
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
double a = sum_row_major();
clock_gettime(CLOCK_MONOTONIC, &t1);
double dt_r = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
clock_gettime(CLOCK_MONOTONIC, &t0);
double b = sum_col_major();
clock_gettime(CLOCK_MONOTONIC, &t1);
double dt_c = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
printf("row=%.3fs col=%.3fs ratio=%.2fx (sums %.0f / %.0f)\n",
dt_r, dt_c, dt_c / dt_r, a, b);
return 0;
}
Compile with gcc -O2 -o matrix_traverse matrix_traverse.c and run multiple times. Record the ratio.
Target outcomes:
- observe row-major is typically 5x-15x faster than column-major
- explain the ratio in terms of 64-byte cache lines and DRAM bandwidth
- confirm with
perf stat -e L1-dcache-load-misses,LLC-load-missesthat the column-major run has roughly 8x more L1 misses
Blocking Practice
Extend the experiment to a matrix multiply C = A * B for N = 1024:
- Write a naïve
ijk-loop version. - Write a blocked version with
B = 32and compare. - Report time and GFLOPS (2 × N^3 / time).
Target outcomes:
- 2x-5x speedup from blocking at
N = 1024 - larger speedup as
Ngrows
Address Decomposition Drill
Given a 64 KiB, 8-way, 64-byte line cache (128 sets):
- Decompose addresses
0x0001_4040,0x0001_C040,0x0002_4040into tag/set/offset. - For sequence
[0x0001_4040, 0x0001_4080, 0x0001_C040, 0x0002_4040, 0x0001_4040], predict hits and misses assuming the cache starts empty. - Name the miss type for each miss.
Evidence Check
This page is complete only if you produced at least one benchmark report comparing row-major and column-major, with measured times and a one-paragraph explanation citing cache line size and effective bandwidth.