Skip to main content

Module 3: Computer Organization & Architecture: Case Studies

These case studies connect C code to instructions, registers, cache lines, branch predictors, SIMD, and memory hierarchy behavior.


Case Study 1: Same Big-O, Different Cache Behavior

Scenario: Two matrix traversal loops both visit every element. Row-major traversal is much faster than column-major traversal on a row-major array.

Source anchor: Ulrich Drepper's What Every Programmer Should Know About Memory explains cache locality and memory hierarchy effects that make access order visible in runtime.

Module concepts: cache line, locality, row-major order, memory hierarchy.

Wrong Approach

"Same O(n), same performance."

Better Approach

Walk memory in layout order:

for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
sum += a[i][j];

Tradeoff Table

ChoiceGainCost
Ignore layout and use any traversalSimple reasoning from algorithm shapeCan waste cache bandwidth badly
Match row-major layoutBetter locality and throughputRequires awareness of representation
Block/tile traversalEven stronger cache reuseMore code and tuning effort

Failure Mode

A loop that looks equivalent in asymptotic analysis runs far slower because every access misses useful cache locality.

Required Artifact

Draw cache lines for row-major and column-major traversal and benchmark both.

Project / Capstone Connection

Use this reasoning when optimizing image, matrix, or buffer-heavy code in later performance work.


Case Study 2: Compiler Explorer Reveals A Branch

Scenario: A tight loop with an unpredictable if runs slower than a branchless version.

Source anchor: Compiler Explorer exposes the generated assembly so learners can inspect whether the compiler emitted a branch, conditional move, or another form.

Module concepts: assembly, branch, prediction, generated code.

Wrong Approach

Guess from C source alone.

Better Approach

Inspect assembly:

C branch:
compare + conditional jump

branchless version:
conditional move or arithmetic mask

Tradeoff Table

ChoiceGainCost
Reason only from sourceFastest first passHides actual machine behavior
Inspect emitted assemblyGrounded evidenceRequires ABI and instruction literacy
Force branchless code everywhereMay help hot unpredictable pathsCan hurt readability or other workloads

Failure Mode

An "obvious" micro-optimization changes source shape but not the generated branch pattern, so performance assumptions stay wrong.

Required Artifact

Paste two C snippets into Compiler Explorer and annotate the branch instruction.

Project / Capstone Connection

Use this workflow whenever you claim a hot-path optimization in systems benchmarks or writeups.


Case Study 3: False Sharing In Counters

Scenario: Four threads update separate counters in the same cache line. Performance collapses because cache lines bounce between cores.

Source anchor: Drepper's memory paper and cache coherence concepts explain why independent variables can still interfere when they share a cache line.

Module concepts: cache line, coherence, false sharing, padding.

Wrong Approach

"Different variables cannot contend."

Better Approach

Separate hot counters by cache line:

struct Counter {
alignas(64) long value;
};

Tradeoff Table

ChoiceGainCost
Pack counters tightlyLess memory useCoherence traffic can dominate runtime
Pad to cache-line sizeRemoves false sharingWastes space
Aggregate locally then mergeLimits contention furtherAdds merge logic and latency

Failure Mode

Each thread updates its own field, but cache coherence invalidations make throughput collapse under multicore load.

Required Artifact

Draw the cache line before/after padding and write a benchmark plan.

Project / Capstone Connection

Apply this when designing per-thread metrics, queues, or worker-state structures in concurrent code.


Case Study 4: Function Call ABI Misread

Scenario: A learner writes inline assembly or reads disassembly and cannot explain where arguments and return values live.

Source anchor: ABI and calling-convention documents are platform-specific; Compiler Explorer and disassembly make the active convention visible on the target toolchain.

Module concepts: register file, stack pointer, calling convention, return address.

Wrong Approach

Assume function calls are abstract jumps with no machine contract.

Better Approach

Trace:

arguments:
registers/stack by ABI

call:
pushes or records return address

return:
value in return register

Tradeoff Table

ChoiceGainCost
Ignore ABI detailsLess initial complexityHard to read disassembly or debug low-level issues
Learn active calling conventionBetter debugging and interopPlatform-specific material to absorb
Inline assembly without ABI careQuick experimentsEasy register clobber and stack bugs

Failure Mode

Inline assembly or FFI code appears correct in source but corrupts arguments, return values, or caller state because ABI rules were guessed.

Required Artifact

Annotate disassembly for a function with six integer arguments and one return value.

Project / Capstone Connection

Use this foundation for debugger sessions, syscall wrappers, and any low-level interop in later modules.


Case Study 5: SIMD Opportunity Hidden In Scalar Loop

Scenario: A loop sums arrays element-by-element. The compiler can vectorize only after aliasing and alignment assumptions are clarified.

Source anchor: Compiler diagnostics and Compiler Explorer reveal vectorization decisions. See GCC optimization options.

Module concepts: SIMD, aliasing, alignment, compiler optimization.

Wrong Approach

"The compiler always optimizes obvious loops."

Better Approach

Make assumptions explicit:

void add(size_t n, float *restrict out,
const float *restrict a,
const float *restrict b);

Tradeoff Table

ChoiceGainCost
Leave aliasing ambiguousMinimal API claimsBlocks vectorization opportunities
Add restrict and alignment factsEnables stronger optimizationIncorrect promises create undefined behavior
Hand-write SIMDMaximum controlLarger maintenance and portability burden

Failure Mode

The compiler declines vectorization because pointers might alias, so a hot numeric loop stays scalar despite suitable hardware.

Required Artifact

Compare assembly/vectorization report before and after restrict or alignment changes.

Project / Capstone Connection

Use this evidence pattern when you justify performance claims for numeric or media-processing kernels.


Source Map

SourceUse it for
What Every Programmer Should Know About Memorycache and memory hierarchy
Compiler Explorerassembly inspection
GCC optimization optionsoptimization/vectorization evidence

Completion Standard

  • At least three artifacts are completed.
  • At least one artifact includes disassembly.
  • At least one artifact explains cache-line behavior.