Module 3: Computer Organization & Architecture
Primary text: Computer Organization and Design (Patterson & Hennessy) Selective support: CODE for conceptual foundations; The C Programming Language as a cross-reference for recognizing compiled patterns; Agner Fog's microarchitecture manuals, Drepper's What Every Programmer Should Know About Memory, and Compiler Explorer for real hardware behavior
This guide is the primary teacher. You do not need to read Computer Organization and Design front-to-back to complete this module. You do need to become operationally strong at reading assembly, reasoning about the memory hierarchy, and predicting performance from a piece of code you can actually see.
Scope of This Module
This module is not a hardware tour. It is where code starts producing measurable behavior on a real machine.
What it covers in depth:
- instruction set architectures, RISC/CISC trade-offs, and the fetch-decode-execute cycle as an actual clocked loop
- the register file, program counter, stack pointer, and the calling convention they impose on C code
- reading disassembly from
objdumpor Compiler Explorer and recognizing compiled patterns for loops, branches, and function calls - the ALU, two's complement arithmetic, multiplication, and the shape of a simple datapath
- control flow at the machine level: branches, direct and indirect calls, returns, and how
retfinds its target - floating-point representation and the price of doubles versus ints
- the memory hierarchy from registers down to disk, in concrete latency units
- cache organization (lines, sets, associativity, replacement) and how a miss reshapes runtime
- writing code that the cache can serve: row-major traversal, blocking, alignment, and why "same algorithm, same O()" can differ by 10x
- the classic five-stage pipeline, hazards, and the hardware techniques that hide them
- superscalar issue, out-of-order execution, SIMD, and speculation as realistic models of modern cores
- memory-mapped I/O, interrupts, DMA, and how devices talk to the CPU without burning cycles
- virtual memory as an address-translation layer backed by page tables and the TLB
- the hardware-software contract: what the ISA promises, what the microarchitecture delivers, and where performance actually comes from
What it deliberately does not try to finish here:
- full OS-level virtual memory, scheduling, or driver work (that is Module 4 and later semesters)
- compiler backend internals beyond recognizing generated code
- digital logic design at the gate level (covered lightly; follow MIT 6.004 if you want depth)
- GPU architecture and distributed memory consistency models
This is an in-depth foundation module. If you can explain the C, but not how the C runs, you are not done.
Before You Start
Answer these closed-book before starting the main path:
- What does "an instruction is fetched, decoded, and executed" actually mean in terms of memory accesses?
- Why does a
structof twoints fit nicely in a register pair on many machines, but a 64-byte object never does? - If a cache line is 64 bytes and an
intis 4 bytes, how many integers share a line, and why does that matter for performance? - What is the difference between a branch being "taken" and a branch being "correctly predicted"?
- Why does virtual memory mean two processes can both use address
0x400000without colliding?
Diagnostic Interpretation
4-5 solid answers
- You are ready for the full path.
2-3 solid answers
- Continue, but expect extra time in Clusters 3 and 4 (memory hierarchy, pipelining).
0-1 solid answers
- Revisit Module 2: Memory, Pointers & Machine Representation first. This module assumes you can already read a pointer and explain stack layout.
What This Module Is For
Architecture is the language performance speaks. Later systems work repeatedly asks questions like:
- why is this loop 5x slower than that loop when they look the same?
- why does
std::vector<T>beatstd::list<T>even when asymptotic complexity favours the list? - what is the cost of a function call, a cache miss, a branch mispredict, a page fault?
- when is it worth writing SIMD by hand or using
__builtin_prefetch? - how do you read a flame graph, a
perfreport, or anobjdumplisting and decide what to change?
This module builds the performance-reasoning vocabulary needed for:
- writing cache-aware data structures in C and C++
- debugging performance regressions with
perf,vtune, orsamply - understanding operating systems, compilers, and linkers without treating them as magic
- making informed trade-offs between clarity, portability, and speed
You are learning to predict how code runs on real hardware, not just what it computes.
Concept Map
How To Use This Module
Work in order. Clusters 3-5 only make sense after you can read an instruction and trace it through a register file.
Cluster 1: The ISA and the Stored-Program Machine
| Order | Concept | Type | Focus |
|---|---|---|---|
| 1 | Instruction Set Architectures, RISC vs CISC, and the Fetch-Decode-Execute Cycle | PRIMARY | What an ISA is, what it promises, and how the CPU's main loop actually runs |
| 2 | Registers, the Program Counter, and the Stack Pointer | PRIMARY | The small, fast, named storage a program actually operates on |
| 3 | From C to Assembly: Reading Disassembly and Recognizing Patterns | PRIMARY | Mapping source constructs to the machine code a compiler generates |
Cluster mastery check: Given a short C function and its disassembly, can you identify the prologue, body, and epilogue and explain every register use?
Cluster 2: Arithmetic, Logic, and Control
| Order | Concept | Type | Focus |
|---|---|---|---|
| 4 | The ALU and How Numbers Are Added and Multiplied | PRIMARY | Two's complement, addition, multiplication, and flag-setting |
| 5 | Control Flow at the Machine Level: Branches, Calls, and Returns | PRIMARY | Compare, jump, call, return, and how the stack keeps track |
| 6 | Floating-Point Operations and Hardware Support | SUPPORTING | IEEE-754 basics, where FP hardware lives, and why doubles cost what they cost |
Cluster mastery check: Can you predict the flags set by sub %rbx, %rax and explain how the following je turns into a taken or not-taken branch?
Cluster 3: Memory Hierarchy and Cache
| Order | Concept | Type | Focus |
|---|---|---|---|
| 7 | The Memory Hierarchy: Registers, Cache, DRAM, and Disk | PRIMARY | Latencies, sizes, and why every level exists |
| 8 | Cache Organization: Lines, Sets, Associativity, and Replacement | PRIMARY | How an address turns into a set index, a tag, and a hit or miss |
| 9 | Cache-Aware Programming: Locality and Memory Access Patterns | PRIMARY | Row-major vs column-major, blocking, and measuring the difference |
Cluster mastery check: Can you explain why sum += A[i][j] with j in the inner loop can run ~10x faster than the same code with i in the inner loop, using a cache diagram?
Cluster 4: Pipelining and Parallel Execution
| Order | Concept | Type | Focus |
|---|---|---|---|
| 10 | The Classic Five-Stage Pipeline | PRIMARY | IF/ID/EX/MEM/WB as overlapping work |
| 11 | Hazards: Data, Control, Structural -- and How to Resolve Them | PRIMARY | Why pipelines stall and the tricks that hide the cost |
| 12 | Superscalar, Out-of-Order, SIMD, and Speculation | SUPPORTING | How real cores go beyond one-instruction-per-cycle |
Cluster mastery check: Given a short instruction stream with a load-use dependency, can you identify where a stall is needed and explain how forwarding or reordering removes it?
Cluster 5: I/O and Virtual Memory Basics
| Order | Concept | Type | Focus |
|---|---|---|---|
| 13 | Memory-Mapped I/O, Interrupts, and DMA | SUPPORTING | How the CPU talks to devices without polling forever |
| 14 | Virtual Memory: Pages, Page Tables, and the TLB | PRIMARY | The address-translation layer under every user process |
| 15 | How the Hardware-Software Contract Shapes Performance | PRIMARY | Amdahl, CPI, the roofline model, and why architecture decides what is fast |
Cluster mastery check: Can you explain, for a specific workload, which level of the stack (ISA, microarchitecture, cache, VM, or software) is the current bottleneck, and how you would confirm it?
Then work these practice pages:
| Order | Practice path | Focus |
|---|---|---|
| 1 | ISA and Disassembly Lab | Reading x86_64 and RISC-V assembly from real compilers |
| 2 | Memory Hierarchy and Cache Workshop | Measuring locality effects and interpreting cache counters |
| 3 | Pipelining and Optimization Clinic | Hazards, forwarding, and simple microbenchmark reasoning |
| 4 | Code Katas | Compiler Explorer drills, matrix traversal benchmarks, and disassembly sprints |
Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.
Learning Objectives
By the end of this module you should be able to:
- Explain what an ISA is, distinguish RISC from CISC, and walk through one fetch-decode-execute cycle at the register-transfer level.
- Identify the role of the program counter, stack pointer, and general-purpose registers in a running C program.
- Read disassembly from
gcc -Sor Compiler Explorer and match it back to source constructs. - Describe the ALU, two's complement arithmetic, and how multiplication is typically implemented.
- Trace control flow through compare, branch, call, and return instructions, including the use of the stack.
- Explain IEEE-754 floating-point and estimate the cost of FP versus integer work.
- Rank storage tiers by latency and size and predict where a working set lives.
- Decode an address into tag/set/offset for a given cache geometry and compute hit/miss for an access sequence.
- Measure and explain row-major vs column-major matrix traversal performance, and apply blocking for a larger working set.
- Diagram the five-stage pipeline and identify data, control, and structural hazards.
- Explain superscalar issue, out-of-order execution, SIMD, and speculation in terms of overlapping useful work.
- Describe memory-mapped I/O, interrupts, and DMA at the level of "what the CPU does when a device becomes ready."
- Walk through a virtual-address translation, including page-table levels and TLB hits/misses.
- Use Amdahl's law, CPI, and the roofline model to diagnose whether a program is compute-, memory-, or latency-bound.
Outputs
- a disassembly notebook with at least 10 annotated functions (C source next to generated x86_64 or RISC-V)
- one "cache experiment" lab report with measured timings for row-major vs column-major traversal across three matrix sizes
- one blocked matrix multiply implementation with a before/after microbenchmark
- one hazard-analysis sheet showing at least 6 short instruction streams with stall/forward reasoning
- one simple
perf stator equivalent run on a workload with cycles, instructions, cache-misses, and branch-mispredicts interpreted - one page translation diagram for a specific virtual address on a two-level page table
- one mistake log naming at least 10 reasoning errors such as
wrong cache geometry,confused taken with predicted,ignored the TLB, orcompared wall time without warm-up - one short memo connecting Module 3 tools to Module 4 (systems programming) and Semester 5 (operating systems)
Completion Standard
You have completed Module 3 when all of these are true:
- you can read a short disassembly listing and narrate every instruction
- you can predict, before measuring, whether a loop will be compute-bound or memory-bound
- you can explain why a cache miss costs hundreds of cycles and a branch mispredict costs tens
- you can describe virtual memory without collapsing pages and page tables into one thing
- you can write a short benchmark that produces believable numbers and explain the warm-up
If the code works but you cannot explain what the hardware did to make it work, the module is not complete.
Reading Policy
- Concept pages are the main path.
- Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuckmeans try the concept page, self-check, and drill first.Optional deep divemeans additional nuance or exercise volume, not required progression.- Because this module is where performance becomes measurable, you should actually run the benchmarks you are asked to write; theory without numbers is not enough here.
Suggested Weekly Flow
| Day | Work |
|---|---|
| 1 | Concepts 1-3 and a Compiler Explorer session where you disassemble five small functions |
| 2 | Concepts 4-6 and one sheet of compare/branch/call tracing by hand |
| 3 | Concepts 7-9 and the row-major vs column-major benchmark |
| 4 | Concepts 10-12 and one hazard-analysis sheet |
| 5 | Concepts 13-15 and one perf stat run on a real program |
| 6 | Practice pages 1-2 and targeted local-book reinforcement |
| 7 | Practice pages 3-4, quiz, and mistake-log cleanup |
Reference
If you need exact links into the local chunked books, use Reference and Selective Reading.
Use three levels depending on depth: Toy Computer if you want to design the ISA and datapath yourself, Emulator if you want compatibility with an existing machine, and Source Code to Machine Code if you want to see how a compiler targets the machine. See Build Your Own X overview.
Rich Learning Pages
Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread