Module Quiz

Complete this quiz after finishing all concept and practice pages.

Current Module Questions

Question 1: Fetch-Decode-Execute

What are the four phases of the fetch-decode-execute cycle, and which piece of architectural state is updated at the end of each instruction?

Answer: Fetch (read instruction at PC from memory), Decode (produce control signals and read operand registers), Execute (ALU/address computation, possibly memory access and write-back), and update PC to the next instruction or the branch target.

Question 2: RISC vs CISC

Why is "x86 is CISC so it is slow" a misleading claim in 2025?

Answer: Modern x86_64 cores internally translate each CISC instruction into one or more RISC-like µops that then flow through a RISC-style out-of-order back-end. The ISA is a language; the microarchitecture is the hardware. Front-end decode cost is a real trade-off, but it does not make the back-end slow.

Question 3: Calling Convention

On x86_64 Linux, which register holds the first argument to a function, which holds the return value, and which register is the stack pointer? Name one caller-saved and one callee-saved register.

Answer: First argument %rdi, return value %rax, stack pointer %rsp. Caller-saved: any of %rax, %rcx, %rdx, %rsi, %rdi, %r8-%r11. Callee-saved: any of %rbx, %rbp, %r12-%r15.

Question 4: Disassembly Recognition

A compiled loop looks like: add %rdi,%rdi ... addq $1,(%rdi); add $4,%rdi; cmp %rdx,%rdi; jne .L1. What C construct did this likely come from?

Answer: A for loop incrementing every int in an array, rewritten by the compiler into a pointer walk with an end-pointer comparison (!= against end). Strength-reduced from an index-based loop.

Question 5: Two's Complement and Flags

Given the 8-bit addition 0x7F + 0x01, what is the result and what do the Z, S, C, O flags get set to?

Answer: Result 0x80. Z=0 (nonzero), S=1 (top bit set), C=0 (no carry-out), O=1 (signed overflow: positive + positive -> negative).

Question 6: Performance Prediction -- Row-Major vs Column-Major

For an N=4096 array of double (32 MiB) on an x86_64 desktop, predict roughly the ratio between row-major and column-major traversal time for a sum, and justify the number using 64-byte cache lines.

Answer: Expect a ratio of roughly 8x (plausibly 5x-15x in practice). Row-major touches each 64-byte line once; eight doubles per miss. Column-major strides by N * sizeof(double) = 32 KiB, missing on every access, so ~8x more misses -- and TLB pressure adds more on top.

Question 7: Performance Prediction -- Cache Miss Cost

If an inner loop takes 1.0 ns per iteration when its working set fits in L1, and 6.0 ns per iteration when it does not, roughly how many DRAM misses per iteration does the second regime imply? Assume a DRAM miss costs ~100 ns.

Answer: About (6.0 - 1.0) / 100 ≈ 0.05 misses per iteration -- roughly one miss per 20 iterations, which is consistent with loading one 64-byte line per 16 ints (if the work is one int per iteration).

Question 8: Pipeline Hazards

In a classic 5-stage pipeline with forwarding, why does a load followed immediately by a consuming ALU op still cost one bubble, while two dependent ALU ops do not?

Answer: Forwarding can route a result out of a producing stage into the next cycle's EX. For ALU->ALU the result is ready at the end of EX, one cycle before the consumer needs it -- forwarding closes the gap. For load->ALU the result only exists at the end of MEM, which is one cycle later; the consumer in EX has nothing to forward from yet, so a one-cycle bubble is required.

Question 9: Branch Misprediction

Given a loop whose back-edge is taken n-1 times and not taken once, how many mispredicts does an ideal branch predictor incur? How does the answer change for a random if (a[i] > k) inside the loop?

Answer: The back-edge mispredicts at most once (when the loop exits), so its amortized cost is ~0. A random data-dependent if mispredicts about 50% of the time regardless of predictor -- the cost scales with the pipeline depth (10-20 cycles × n/2 mispredicts).

Question 10: SIMD and Unrolling

A scalar FP reduction loop with one accumulator sustains one FMA per 4 cycles. Explain, in terms of dependency chains, why breaking it into 4 accumulators can give a 4x speedup even without using SIMD.

Answer: With one accumulator the loop's critical path is acc = acc + ..., whose latency (4 cycles on many microarchs) gates throughput. Four independent accumulators make four independent dependency chains; the out-of-order engine can overlap them, so total throughput is limited by FMA issue rate (typically 1/cycle) rather than latency.

Question 11: Cache Geometry

Decompose the address 0x0000_0000_0004_ABCD into tag, set, offset for a 32 KiB, 8-way, 64-byte line L1.

Answer: 64 sets = 6 index bits; 64-byte line = 6 offset bits. Low 6 bits (0xD -> 001101 -> offset 0x0D)? Actually offset is the low 6 bits of 0xABCD = ...1011 1100 1101; lowest 6 = 001101 = 0xD. Next 6 bits (set index) = 101111 = 47. Tag = everything above bit 12: upper bits of 0x0000_0000_0004_ABCD above bit 12 -> 0x4A right-shifted; the key point is that offset = 0x0D, set = 47, tag = the rest.

Question 12: Virtual Memory

Why can a TLB miss cost more than the memory access it is translating?

Answer: Because the page-table walk performs multiple (typically four on x86_64) memory accesses itself, each of which may hit or miss the data caches. If the walk bottoms out in DRAM, the translation alone can cost hundreds of cycles before the translated access can even start.

Question 13: Performance Prediction -- TLB

A program sweeps 2 GiB of memory one byte per 4 KiB page. What mechanism slows it down, and which OS-level knob most directly reduces that cost?

Answer: TLB misses -- there are ~524,288 pages, far more than any TLB. The fix is huge pages (MAP_HUGETLB or madvise(MADV_HUGEPAGE)), where a 2 MiB page covers 512 ordinary pages with a single TLB entry.

Question 14: Hardware-Software Contract

State Amdahl's law and use it to explain why a workload that is "95% parallel" cannot exceed 20x speedup.

Answer: speedup = 1 / ((1-p) + p/N). For p = 0.95 and N -> ∞, the denominator approaches 1 - p = 0.05, giving a maximum speedup of 1 / 0.05 = 20x. The serial fraction caps the total no matter how many cores you add.

Question 15: Roofline

A kernel has arithmetic intensity 0.25 flops/byte. Peak compute is 128 GFLOPS; DRAM bandwidth is 32 GB/s. What is the attainable performance, and is the kernel compute- or memory-bound?

Answer: Attainable = min(128, 0.25 × 32) = min(128, 8) = 8 GFLOPS. Memory-bound. To improve, raise arithmetic intensity (blocking, tiling, fusion) or hide some of the bandwidth cost with prefetch.

Interleaved Review Questions

Prior Module Question 1

Why is int *p = (int *)0x400000 not a safe way to access memory in a modern Linux process?

Answer: Because the address space is virtual and permission-checked. 0x400000 is typically the code segment (read-only, executable), not writable data. A store there would page-fault.

Prior Module Question 2

What is the difference between stack and heap allocation in C?

Answer: Stack is LIFO, auto-released when a function returns, sized at compile time; heap is explicit (malloc/free), outlives the function, may be fragmented.

Prior Module Question 3

What does const T *p promise compared to T * const p?

Answer: const T *p = pointer to constant data; you cannot modify *p. T * const p = constant pointer; you cannot reassign p, but *p is still mutable.

Prior Module Question 4

Why does endianness matter when serializing a uint32_t to bytes?

Answer: Different machines order the four bytes differently in memory. Writing the raw bytes on a little-endian machine and reading them on a big-endian machine reinterprets the integer unless a canonical byte order is chosen.

Prior Module Question 5

What does -O2 typically do beyond -O0?

Answer: Keeps locals in registers, inlines small functions, unrolls simple loops, performs common-subexpression elimination and strength reduction, reorders instructions to avoid stalls, and eliminates dead code.

Self-Assessment and Remediation

Mastery Level (90-100% correct):

Ready to advance with confidence.

Proficient Level (75-89% correct):

Review the missed concept pages and redo the relevant practice experiment with measured numbers.

Developing Level (60-74% correct):

Rework the Memory Hierarchy Workshop and the Pipelining Clinic. Rebuild the mental cost model for a cache miss and a mispredict before advancing.

Insufficient Level (<60% correct):

Return to Cluster 1 and walk through disassembly for five simple functions cold. Then repeat this quiz.

Current Module Questions​

Question 1: Fetch-Decode-Execute​

Question 2: RISC vs CISC​

Question 3: Calling Convention​

Question 4: Disassembly Recognition​

Question 5: Two's Complement and Flags​

Question 6: Performance Prediction -- Row-Major vs Column-Major​

Question 7: Performance Prediction -- Cache Miss Cost​

Question 8: Pipeline Hazards​

Question 9: Branch Misprediction​

Question 10: SIMD and Unrolling​

Question 11: Cache Geometry​

Question 12: Virtual Memory​

Question 13: Performance Prediction -- TLB​

Question 14: Hardware-Software Contract​

Question 15: Roofline​

Interleaved Review Questions​

Prior Module Question 1​

Prior Module Question 2​

Prior Module Question 3​

Prior Module Question 4​

Prior Module Question 5​

Self-Assessment and Remediation​