Skip to main content

Code Katas

Short, repeated drills. 15-25 minutes each. Do them more than once; speed and fluency are the goal, not novelty.

Kata 1: Godbolt Sprint (20 minutes)

Open Compiler Explorer with gcc -O2 -std=c11 for x86_64.

For each of these 10 C snippets, write the function, read the assembly, and annotate every instruction out loud:

int id(int x);
int inc(int x);
int neg(int x);
int is_neg(int x);
int max2(int a, int b);
int min2(int a, int b);
long sum1(const long *a, int n);
long sum_even(const long *a, int n);
int list_len(struct node { int v; struct node *next; } *h);
int streq(const char *a, const char *b);

Target: under one minute per function once you are fluent.

Kata 2: Cache-Friendly Matrix Traversal (25 minutes)

Using the benchmark harness from Memory Hierarchy and Cache Workshop:

  1. Time row-major, column-major, and blocked (B=32) sums of an N × N matrix for N ∈ {256, 1024, 4096}.
  2. Record time-per-element in nanoseconds for each (traversal, N) pair.
  3. Produce a 9-cell table and write a one-paragraph interpretation.

Target outcomes:

  • column-major slower by 5x-15x vs row-major at the largest N
  • blocked version faster than plain row-major once the square matrix stops fitting in L2
  • clear cache-size "kneepoints" visible in your numbers

Kata 3: Small Benchmark Harness (20 minutes)

Write a reusable microbenchmark in C:

// bench.c -- usage: bench("name", fn, repeats, warmup)
// - picks a clock with nanosecond resolution (clock_gettime CLOCK_MONOTONIC)
// - runs `warmup` times before timing
// - reports min, median, max over `repeats` runs
// - measures CPU-cycle counter via __rdtscp where available as a sanity check

Validate it by running memcpy on 1 MiB and checking that the measured bandwidth is within 30% of published DRAM bandwidth for your CPU.

Target outcomes:

  • stable numbers: min, median, and max within 10% of each other for a well-behaved kernel
  • you trust the numbers enough to reuse the harness for future experiments

Kata 4: Disassemble and Predict (20 minutes)

Pick any function you wrote in Semester 4 Module 1 or 2 and:

  1. Compile with gcc -O2 -g and run objdump -d --source.
  2. For each inner loop, predict:
    • instructions per iteration
    • number of loads and stores per iteration
    • whether the loop is compute-, memory-, or branch-bound
  3. Check your predictions with perf stat.

Target: over time, your predictions should match measurements within ~20% for simple loops.

Kata 5: Flag Reading (15 minutes)

For each x86_64 snippet, compute the final state of flags Z, S, C, O by hand, then verify under gdb:

; initial rax=5, rbx=3
sub %rbx, %rax
; initial rax=3, rbx=5
sub %rbx, %rax
; initial rax=0x7FFFFFFF_FFFFFFFF (max signed), rbx=1
add %rbx, %rax
; initial rax=0xFFFFFFFF_FFFFFFFF, rbx=1
add %rbx, %rax

Target: correct flag state in under 30 seconds per instruction.

Kata 6: Branchless Rewrite (15 minutes)

Rewrite each of these in branchless form and time both versions on random input:

int clamp(int x, int lo, int hi);
int sign(int x);
int abs_int(int x);
int max3(int a, int b, int c);

Measure: perf stat -e branch-misses before and after. Confirm the branchless version drops mispredicts near zero.

Kata 7: TLB Probe (25 minutes)

Touch one byte every 4 KiB across progressively larger arrays:

for (size_t pages = 16; pages <= 1 << 20; pages *= 2) {
size_t bytes = pages * 4096;
char *p = mmap(NULL, bytes, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// touch first byte of each page, then time a repeated sweep
// measure perf dTLB-load-misses
}

Identify where dTLB-load-misses starts rising -- that is roughly the L2 TLB coverage. Compare with and without MAP_HUGETLB.

Evidence Check

Complete when you have:

  • at least three katas with recorded measurements in a notebook
  • at least one disassembly annotation you can narrate cold
  • at least one microbenchmark whose numbers you trust within 10%