Module 3: Computer Organization & Architecture

Primary text: Computer Organization and Design (Patterson & Hennessy) Selective support: CODE for conceptual foundations; The C Programming Language as a cross-reference for recognizing compiled patterns; Agner Fog's microarchitecture manuals, Drepper's What Every Programmer Should Know About Memory, and Compiler Explorer for real hardware behavior

This guide is the primary teacher. You do not need to read Computer Organization and Design front-to-back to complete this module. You do need to become operationally strong at reading assembly, reasoning about the memory hierarchy, and predicting performance from a piece of code you can actually see.

Scope of This Module

This module is not a hardware tour. It is where code starts producing measurable behavior on a real machine.

What it covers in depth:

instruction set architectures, RISC/CISC trade-offs, and the fetch-decode-execute cycle as an actual clocked loop
the register file, program counter, stack pointer, and the calling convention they impose on C code
reading disassembly from objdump or Compiler Explorer and recognizing compiled patterns for loops, branches, and function calls
the ALU, two's complement arithmetic, multiplication, and the shape of a simple datapath
control flow at the machine level: branches, direct and indirect calls, returns, and how ret finds its target
floating-point representation and the price of doubles versus ints
the memory hierarchy from registers down to disk, in concrete latency units
cache organization (lines, sets, associativity, replacement) and how a miss reshapes runtime
writing code that the cache can serve: row-major traversal, blocking, alignment, and why "same algorithm, same O()" can differ by 10x
the classic five-stage pipeline, hazards, and the hardware techniques that hide them
superscalar issue, out-of-order execution, SIMD, and speculation as realistic models of modern cores
memory-mapped I/O, interrupts, DMA, and how devices talk to the CPU without burning cycles
virtual memory as an address-translation layer backed by page tables and the TLB
the hardware-software contract: what the ISA promises, what the microarchitecture delivers, and where performance actually comes from

What it deliberately does not try to finish here:

full OS-level virtual memory, scheduling, or driver work (that is Module 4 and later semesters)
compiler backend internals beyond recognizing generated code
digital logic design at the gate level (covered lightly; follow MIT 6.004 if you want depth)
GPU architecture and distributed memory consistency models

This is an in-depth foundation module. If you can explain the C, but not how the C runs, you are not done.

Before You Start

Answer these closed-book before starting the main path:

What does "an instruction is fetched, decoded, and executed" actually mean in terms of memory accesses?
Why does a struct of two ints fit nicely in a register pair on many machines, but a 64-byte object never does?
If a cache line is 64 bytes and an int is 4 bytes, how many integers share a line, and why does that matter for performance?
What is the difference between a branch being "taken" and a branch being "correctly predicted"?
Why does virtual memory mean two processes can both use address 0x400000 without colliding?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path.

2-3 solid answers

Continue, but expect extra time in Clusters 3 and 4 (memory hierarchy, pipelining).

0-1 solid answers

Revisit Module 2: Memory, Pointers & Machine Representation first. This module assumes you can already read a pointer and explain stack layout.

What This Module Is For

Architecture is the language performance speaks. Later systems work repeatedly asks questions like:

why is this loop 5x slower than that loop when they look the same?
why does std::vector<T> beat std::list<T> even when asymptotic complexity favours the list?
what is the cost of a function call, a cache miss, a branch mispredict, a page fault?
when is it worth writing SIMD by hand or using __builtin_prefetch?
how do you read a flame graph, a perf report, or an objdump listing and decide what to change?

This module builds the performance-reasoning vocabulary needed for:

writing cache-aware data structures in C and C++
debugging performance regressions with perf, vtune, or samply
understanding operating systems, compilers, and linkers without treating them as magic
making informed trade-offs between clarity, portability, and speed

You are learning to predict how code runs on real hardware, not just what it computes.

Concept Map

How To Use This Module

Work in order. Clusters 3-5 only make sense after you can read an instruction and trace it through a register file.

Cluster 1: The ISA and the Stored-Program Machine

Order	Concept	Type	Focus
1	Instruction Set Architectures, RISC vs CISC, and the Fetch-Decode-Execute Cycle	PRIMARY	What an ISA is, what it promises, and how the CPU's main loop actually runs
2	Registers, the Program Counter, and the Stack Pointer	PRIMARY	The small, fast, named storage a program actually operates on
3	From C to Assembly: Reading Disassembly and Recognizing Patterns	PRIMARY	Mapping source constructs to the machine code a compiler generates

Cluster mastery check: Given a short C function and its disassembly, can you identify the prologue, body, and epilogue and explain every register use?

Cluster 2: Arithmetic, Logic, and Control

Order	Concept	Type	Focus
4	The ALU and How Numbers Are Added and Multiplied	PRIMARY	Two's complement, addition, multiplication, and flag-setting
5	Control Flow at the Machine Level: Branches, Calls, and Returns	PRIMARY	Compare, jump, call, return, and how the stack keeps track
6	Floating-Point Operations and Hardware Support	SUPPORTING	IEEE-754 basics, where FP hardware lives, and why doubles cost what they cost

Cluster mastery check: Can you predict the flags set by sub %rbx, %rax and explain how the following je turns into a taken or not-taken branch?

Cluster 3: Memory Hierarchy and Cache

Order	Concept	Type	Focus
7	The Memory Hierarchy: Registers, Cache, DRAM, and Disk	PRIMARY	Latencies, sizes, and why every level exists
8	Cache Organization: Lines, Sets, Associativity, and Replacement	PRIMARY	How an address turns into a set index, a tag, and a hit or miss
9	Cache-Aware Programming: Locality and Memory Access Patterns	PRIMARY	Row-major vs column-major, blocking, and measuring the difference

Cluster mastery check: Can you explain why sum += A[i][j] with j in the inner loop can run ~10x faster than the same code with i in the inner loop, using a cache diagram?

Cluster 4: Pipelining and Parallel Execution

Order	Concept	Type	Focus
10	The Classic Five-Stage Pipeline	PRIMARY	IF/ID/EX/MEM/WB as overlapping work
11	Hazards: Data, Control, Structural -- and How to Resolve Them	PRIMARY	Why pipelines stall and the tricks that hide the cost
12	Superscalar, Out-of-Order, SIMD, and Speculation	SUPPORTING	How real cores go beyond one-instruction-per-cycle

Cluster mastery check: Given a short instruction stream with a load-use dependency, can you identify where a stall is needed and explain how forwarding or reordering removes it?

Cluster 5: I/O and Virtual Memory Basics

Order	Concept	Type	Focus
13	Memory-Mapped I/O, Interrupts, and DMA	SUPPORTING	How the CPU talks to devices without polling forever
14	Virtual Memory: Pages, Page Tables, and the TLB	PRIMARY	The address-translation layer under every user process
15	How the Hardware-Software Contract Shapes Performance	PRIMARY	Amdahl, CPI, the roofline model, and why architecture decides what is fast

Cluster mastery check: Can you explain, for a specific workload, which level of the stack (ISA, microarchitecture, cache, VM, or software) is the current bottleneck, and how you would confirm it?

Then work these practice pages:

Order	Practice path	Focus
1	ISA and Disassembly Lab	Reading x86_64 and RISC-V assembly from real compilers
2	Memory Hierarchy and Cache Workshop	Measuring locality effects and interpreting cache counters
3	Pipelining and Optimization Clinic	Hazards, forwarding, and simple microbenchmark reasoning
4	Code Katas	Compiler Explorer drills, matrix traversal benchmarks, and disassembly sprints

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Explain what an ISA is, distinguish RISC from CISC, and walk through one fetch-decode-execute cycle at the register-transfer level.
Identify the role of the program counter, stack pointer, and general-purpose registers in a running C program.
Read disassembly from gcc -S or Compiler Explorer and match it back to source constructs.
Describe the ALU, two's complement arithmetic, and how multiplication is typically implemented.
Trace control flow through compare, branch, call, and return instructions, including the use of the stack.
Explain IEEE-754 floating-point and estimate the cost of FP versus integer work.
Rank storage tiers by latency and size and predict where a working set lives.
Decode an address into tag/set/offset for a given cache geometry and compute hit/miss for an access sequence.
Measure and explain row-major vs column-major matrix traversal performance, and apply blocking for a larger working set.
Diagram the five-stage pipeline and identify data, control, and structural hazards.
Explain superscalar issue, out-of-order execution, SIMD, and speculation in terms of overlapping useful work.
Describe memory-mapped I/O, interrupts, and DMA at the level of "what the CPU does when a device becomes ready."
Walk through a virtual-address translation, including page-table levels and TLB hits/misses.
Use Amdahl's law, CPI, and the roofline model to diagnose whether a program is compute-, memory-, or latency-bound.

Outputs

a disassembly notebook with at least 10 annotated functions (C source next to generated x86_64 or RISC-V)
one "cache experiment" lab report with measured timings for row-major vs column-major traversal across three matrix sizes
one blocked matrix multiply implementation with a before/after microbenchmark
one hazard-analysis sheet showing at least 6 short instruction streams with stall/forward reasoning
one simple perf stat or equivalent run on a workload with cycles, instructions, cache-misses, and branch-mispredicts interpreted
one page translation diagram for a specific virtual address on a two-level page table
one mistake log naming at least 10 reasoning errors such as wrong cache geometry, confused taken with predicted, ignored the TLB, or compared wall time without warm-up
one short memo connecting Module 3 tools to Module 4 (systems programming) and Semester 5 (operating systems)

Completion Standard

You have completed Module 3 when all of these are true:

you can read a short disassembly listing and narrate every instruction
you can predict, before measuring, whether a loop will be compute-bound or memory-bound
you can explain why a cache miss costs hundreds of cycles and a branch mispredict costs tens
you can describe virtual memory without collapsing pages and page tables into one thing
you can write a short benchmark that produces believable numbers and explain the warm-up

If the code works but you cannot explain what the hardware did to make it work, the module is not complete.

Reading Policy

Concept pages are the main path.
Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuck means try the concept page, self-check, and drill first.
Optional deep dive means additional nuance or exercise volume, not required progression.
Because this module is where performance becomes measurable, you should actually run the benchmarks you are asked to write; theory without numbers is not enough here.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3 and a Compiler Explorer session where you disassemble five small functions
2	Concepts 4-6 and one sheet of compare/branch/call tracing by hand
3	Concepts 7-9 and the row-major vs column-major benchmark
4	Concepts 10-12 and one hazard-analysis sheet
5	Concepts 13-15 and one `perf stat` run on a real program
6	Practice pages 1-2 and targeted local-book reinforcement
7	Practice pages 3-4, quiz, and mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.

Build Your Own X - elective

Use three levels depending on depth: Toy Computer if you want to design the ISA and datapath yourself, Emulator if you want compatibility with an existing machine, and Source Code to Machine Code if you want to see how a compiler targets the machine. See Build Your Own X overview.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: The ISA and the Stored-Program Machine​

Cluster 2: Arithmetic, Logic, and Control​

Cluster 3: Memory Hierarchy and Cache​

Cluster 4: Pipelining and Parallel Execution​

Cluster 5: I/O and Virtual Memory Basics​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​