Skip to main content

Module 3: Computer Organization & Architecture

Primary text: Computer Organization and Design (Patterson & Hennessy) Selective support: CODE for conceptual foundations; The C Programming Language as a cross-reference for recognizing compiled patterns; Agner Fog's microarchitecture manuals, Drepper's What Every Programmer Should Know About Memory, and Compiler Explorer for real hardware behavior

This guide is the primary teacher. You do not need to read Computer Organization and Design front-to-back to complete this module. You do need to become operationally strong at reading assembly, reasoning about the memory hierarchy, and predicting performance from a piece of code you can actually see.


Scope of This Module

This module is not a hardware tour. It is where code starts producing measurable behavior on a real machine.

What it covers in depth:

  • instruction set architectures, RISC/CISC trade-offs, and the fetch-decode-execute cycle as an actual clocked loop
  • the register file, program counter, stack pointer, and the calling convention they impose on C code
  • reading disassembly from objdump or Compiler Explorer and recognizing compiled patterns for loops, branches, and function calls
  • the ALU, two's complement arithmetic, multiplication, and the shape of a simple datapath
  • control flow at the machine level: branches, direct and indirect calls, returns, and how ret finds its target
  • floating-point representation and the price of doubles versus ints
  • the memory hierarchy from registers down to disk, in concrete latency units
  • cache organization (lines, sets, associativity, replacement) and how a miss reshapes runtime
  • writing code that the cache can serve: row-major traversal, blocking, alignment, and why "same algorithm, same O()" can differ by 10x
  • the classic five-stage pipeline, hazards, and the hardware techniques that hide them
  • superscalar issue, out-of-order execution, SIMD, and speculation as realistic models of modern cores
  • memory-mapped I/O, interrupts, DMA, and how devices talk to the CPU without burning cycles
  • virtual memory as an address-translation layer backed by page tables and the TLB
  • the hardware-software contract: what the ISA promises, what the microarchitecture delivers, and where performance actually comes from

What it deliberately does not try to finish here:

  • full OS-level virtual memory, scheduling, or driver work (that is Module 4 and later semesters)
  • compiler backend internals beyond recognizing generated code
  • digital logic design at the gate level (covered lightly; follow MIT 6.004 if you want depth)
  • GPU architecture and distributed memory consistency models

This is an in-depth foundation module. If you can explain the C, but not how the C runs, you are not done.


Before You Start

Answer these closed-book before starting the main path:

  1. What does "an instruction is fetched, decoded, and executed" actually mean in terms of memory accesses?
  2. Why does a struct of two ints fit nicely in a register pair on many machines, but a 64-byte object never does?
  3. If a cache line is 64 bytes and an int is 4 bytes, how many integers share a line, and why does that matter for performance?
  4. What is the difference between a branch being "taken" and a branch being "correctly predicted"?
  5. Why does virtual memory mean two processes can both use address 0x400000 without colliding?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path.

2-3 solid answers

  • Continue, but expect extra time in Clusters 3 and 4 (memory hierarchy, pipelining).

0-1 solid answers


What This Module Is For

Architecture is the language performance speaks. Later systems work repeatedly asks questions like:

  • why is this loop 5x slower than that loop when they look the same?
  • why does std::vector<T> beat std::list<T> even when asymptotic complexity favours the list?
  • what is the cost of a function call, a cache miss, a branch mispredict, a page fault?
  • when is it worth writing SIMD by hand or using __builtin_prefetch?
  • how do you read a flame graph, a perf report, or an objdump listing and decide what to change?

This module builds the performance-reasoning vocabulary needed for:

  • writing cache-aware data structures in C and C++
  • debugging performance regressions with perf, vtune, or samply
  • understanding operating systems, compilers, and linkers without treating them as magic
  • making informed trade-offs between clarity, portability, and speed

You are learning to predict how code runs on real hardware, not just what it computes.


Concept Map


How To Use This Module

Work in order. Clusters 3-5 only make sense after you can read an instruction and trace it through a register file.

Cluster 1: The ISA and the Stored-Program Machine

OrderConceptTypeFocus
1Instruction Set Architectures, RISC vs CISC, and the Fetch-Decode-Execute CyclePRIMARYWhat an ISA is, what it promises, and how the CPU's main loop actually runs
2Registers, the Program Counter, and the Stack PointerPRIMARYThe small, fast, named storage a program actually operates on
3From C to Assembly: Reading Disassembly and Recognizing PatternsPRIMARYMapping source constructs to the machine code a compiler generates

Cluster mastery check: Given a short C function and its disassembly, can you identify the prologue, body, and epilogue and explain every register use?

Cluster 2: Arithmetic, Logic, and Control

OrderConceptTypeFocus
4The ALU and How Numbers Are Added and MultipliedPRIMARYTwo's complement, addition, multiplication, and flag-setting
5Control Flow at the Machine Level: Branches, Calls, and ReturnsPRIMARYCompare, jump, call, return, and how the stack keeps track
6Floating-Point Operations and Hardware SupportSUPPORTINGIEEE-754 basics, where FP hardware lives, and why doubles cost what they cost

Cluster mastery check: Can you predict the flags set by sub %rbx, %rax and explain how the following je turns into a taken or not-taken branch?

Cluster 3: Memory Hierarchy and Cache

OrderConceptTypeFocus
7The Memory Hierarchy: Registers, Cache, DRAM, and DiskPRIMARYLatencies, sizes, and why every level exists
8Cache Organization: Lines, Sets, Associativity, and ReplacementPRIMARYHow an address turns into a set index, a tag, and a hit or miss
9Cache-Aware Programming: Locality and Memory Access PatternsPRIMARYRow-major vs column-major, blocking, and measuring the difference

Cluster mastery check: Can you explain why sum += A[i][j] with j in the inner loop can run ~10x faster than the same code with i in the inner loop, using a cache diagram?

Cluster 4: Pipelining and Parallel Execution

OrderConceptTypeFocus
10The Classic Five-Stage PipelinePRIMARYIF/ID/EX/MEM/WB as overlapping work
11Hazards: Data, Control, Structural -- and How to Resolve ThemPRIMARYWhy pipelines stall and the tricks that hide the cost
12Superscalar, Out-of-Order, SIMD, and SpeculationSUPPORTINGHow real cores go beyond one-instruction-per-cycle

Cluster mastery check: Given a short instruction stream with a load-use dependency, can you identify where a stall is needed and explain how forwarding or reordering removes it?

Cluster 5: I/O and Virtual Memory Basics

OrderConceptTypeFocus
13Memory-Mapped I/O, Interrupts, and DMASUPPORTINGHow the CPU talks to devices without polling forever
14Virtual Memory: Pages, Page Tables, and the TLBPRIMARYThe address-translation layer under every user process
15How the Hardware-Software Contract Shapes PerformancePRIMARYAmdahl, CPI, the roofline model, and why architecture decides what is fast

Cluster mastery check: Can you explain, for a specific workload, which level of the stack (ISA, microarchitecture, cache, VM, or software) is the current bottleneck, and how you would confirm it?

Then work these practice pages:

OrderPractice pathFocus
1ISA and Disassembly LabReading x86_64 and RISC-V assembly from real compilers
2Memory Hierarchy and Cache WorkshopMeasuring locality effects and interpreting cache counters
3Pipelining and Optimization ClinicHazards, forwarding, and simple microbenchmark reasoning
4Code KatasCompiler Explorer drills, matrix traversal benchmarks, and disassembly sprints

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Explain what an ISA is, distinguish RISC from CISC, and walk through one fetch-decode-execute cycle at the register-transfer level.
  2. Identify the role of the program counter, stack pointer, and general-purpose registers in a running C program.
  3. Read disassembly from gcc -S or Compiler Explorer and match it back to source constructs.
  4. Describe the ALU, two's complement arithmetic, and how multiplication is typically implemented.
  5. Trace control flow through compare, branch, call, and return instructions, including the use of the stack.
  6. Explain IEEE-754 floating-point and estimate the cost of FP versus integer work.
  7. Rank storage tiers by latency and size and predict where a working set lives.
  8. Decode an address into tag/set/offset for a given cache geometry and compute hit/miss for an access sequence.
  9. Measure and explain row-major vs column-major matrix traversal performance, and apply blocking for a larger working set.
  10. Diagram the five-stage pipeline and identify data, control, and structural hazards.
  11. Explain superscalar issue, out-of-order execution, SIMD, and speculation in terms of overlapping useful work.
  12. Describe memory-mapped I/O, interrupts, and DMA at the level of "what the CPU does when a device becomes ready."
  13. Walk through a virtual-address translation, including page-table levels and TLB hits/misses.
  14. Use Amdahl's law, CPI, and the roofline model to diagnose whether a program is compute-, memory-, or latency-bound.

Outputs

  • a disassembly notebook with at least 10 annotated functions (C source next to generated x86_64 or RISC-V)
  • one "cache experiment" lab report with measured timings for row-major vs column-major traversal across three matrix sizes
  • one blocked matrix multiply implementation with a before/after microbenchmark
  • one hazard-analysis sheet showing at least 6 short instruction streams with stall/forward reasoning
  • one simple perf stat or equivalent run on a workload with cycles, instructions, cache-misses, and branch-mispredicts interpreted
  • one page translation diagram for a specific virtual address on a two-level page table
  • one mistake log naming at least 10 reasoning errors such as wrong cache geometry, confused taken with predicted, ignored the TLB, or compared wall time without warm-up
  • one short memo connecting Module 3 tools to Module 4 (systems programming) and Semester 5 (operating systems)

Completion Standard

You have completed Module 3 when all of these are true:

  • you can read a short disassembly listing and narrate every instruction
  • you can predict, before measuring, whether a loop will be compute-bound or memory-bound
  • you can explain why a cache miss costs hundreds of cycles and a branch mispredict costs tens
  • you can describe virtual memory without collapsing pages and page tables into one thing
  • you can write a short benchmark that produces believable numbers and explain the warm-up

If the code works but you cannot explain what the hardware did to make it work, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks are selective reinforcement, not a second syllabus.
  • Read only if stuck means try the concept page, self-check, and drill first.
  • Optional deep dive means additional nuance or exercise volume, not required progression.
  • Because this module is where performance becomes measurable, you should actually run the benchmarks you are asked to write; theory without numbers is not enough here.

Suggested Weekly Flow

DayWork
1Concepts 1-3 and a Compiler Explorer session where you disassemble five small functions
2Concepts 4-6 and one sheet of compare/branch/call tracing by hand
3Concepts 7-9 and the row-major vs column-major benchmark
4Concepts 10-12 and one hazard-analysis sheet
5Concepts 13-15 and one perf stat run on a real program
6Practice pages 1-2 and targeted local-book reinforcement
7Practice pages 3-4, quiz, and mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Build Your Own X - elective

Use three levels depending on depth: Toy Computer if you want to design the ISA and datapath yourself, Emulator if you want compatibility with an existing machine, and Source Code to Machine Code if you want to see how a compiler targets the machine. See Build Your Own X overview.

Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread