Code Katas
Short, repeated drills. 15-25 minutes each. Do them more than once; speed and fluency are the goal, not novelty.
Kata 1: Godbolt Sprint (20 minutes)
Open Compiler Explorer with gcc -O2 -std=c11 for x86_64.
For each of these 10 C snippets, write the function, read the assembly, and annotate every instruction out loud:
int id(int x);
int inc(int x);
int neg(int x);
int is_neg(int x);
int max2(int a, int b);
int min2(int a, int b);
long sum1(const long *a, int n);
long sum_even(const long *a, int n);
int list_len(struct node { int v; struct node *next; } *h);
int streq(const char *a, const char *b);
Target: under one minute per function once you are fluent.
Kata 2: Cache-Friendly Matrix Traversal (25 minutes)
Using the benchmark harness from Memory Hierarchy and Cache Workshop:
- Time row-major, column-major, and blocked (
B=32) sums of anN × Nmatrix forN ∈ {256, 1024, 4096}. - Record time-per-element in nanoseconds for each
(traversal, N)pair. - Produce a 9-cell table and write a one-paragraph interpretation.
Target outcomes:
- column-major slower by 5x-15x vs row-major at the largest
N - blocked version faster than plain row-major once the square matrix stops fitting in L2
- clear cache-size "kneepoints" visible in your numbers
Kata 3: Small Benchmark Harness (20 minutes)
Write a reusable microbenchmark in C:
// bench.c -- usage: bench("name", fn, repeats, warmup)
// - picks a clock with nanosecond resolution (clock_gettime CLOCK_MONOTONIC)
// - runs `warmup` times before timing
// - reports min, median, max over `repeats` runs
// - measures CPU-cycle counter via __rdtscp where available as a sanity check
Validate it by running memcpy on 1 MiB and checking that the measured bandwidth is within 30% of published DRAM bandwidth for your CPU.
Target outcomes:
- stable numbers: min, median, and max within 10% of each other for a well-behaved kernel
- you trust the numbers enough to reuse the harness for future experiments
Kata 4: Disassemble and Predict (20 minutes)
Pick any function you wrote in Semester 4 Module 1 or 2 and:
- Compile with
gcc -O2 -gand runobjdump -d --source. - For each inner loop, predict:
- instructions per iteration
- number of loads and stores per iteration
- whether the loop is compute-, memory-, or branch-bound
- Check your predictions with
perf stat.
Target: over time, your predictions should match measurements within ~20% for simple loops.
Kata 5: Flag Reading (15 minutes)
For each x86_64 snippet, compute the final state of flags Z, S, C, O by hand, then verify under gdb:
; initial rax=5, rbx=3
sub %rbx, %rax
; initial rax=3, rbx=5
sub %rbx, %rax
; initial rax=0x7FFFFFFF_FFFFFFFF (max signed), rbx=1
add %rbx, %rax
; initial rax=0xFFFFFFFF_FFFFFFFF, rbx=1
add %rbx, %rax
Target: correct flag state in under 30 seconds per instruction.
Kata 6: Branchless Rewrite (15 minutes)
Rewrite each of these in branchless form and time both versions on random input:
int clamp(int x, int lo, int hi);
int sign(int x);
int abs_int(int x);
int max3(int a, int b, int c);
Measure: perf stat -e branch-misses before and after. Confirm the branchless version drops mispredicts near zero.
Kata 7: TLB Probe (25 minutes)
Touch one byte every 4 KiB across progressively larger arrays:
for (size_t pages = 16; pages <= 1 << 20; pages *= 2) {
size_t bytes = pages * 4096;
char *p = mmap(NULL, bytes, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// touch first byte of each page, then time a repeated sweep
// measure perf dTLB-load-misses
}
Identify where dTLB-load-misses starts rising -- that is roughly the L2 TLB coverage. Compare with and without MAP_HUGETLB.
Evidence Check
Complete when you have:
- at least three katas with recorded measurements in a notebook
- at least one disassembly annotation you can narrate cold
- at least one microbenchmark whose numbers you trust within 10%