Skip to main content

Build From Source Code to Machine Code

This project is the native-code path through compiler construction. Instead of stopping at a tree-walking interpreter or bytecode VM, you lower a tiny language into executable machine behavior: virtual instructions, real assembly, calling conventions, stack frames, executable memory, and eventually object or binary layout.

Use this after Interpreter, or as the advanced path after Compiler.


1. Overview & motivation

Build a small compiler pipeline:

source -> parser -> AST -> semantic checks -> virtual instructions
-> assembly or machine code -> executable function/program

The language can be small: integer expressions, variables, conditionals, loops, and functions. The value is in the lowering path.


2. Where this fits in the degree

  • Phase: Systems
  • Semester: 4 (Systems Programming)
  • Modules deepened: Module 2 (machine representation), Module 3 (computer organization), Module 5 (abstraction and interpretation)

Cross-phase relevance:

  • Builds on Compiler but changes the target from bytecode to machine code.
  • Connects to Toy Computer: a target machine is a contract.
  • Helps explain how language runtimes, JITs, FFI, and executable formats work.

3. Local source backbone

Primary local source:

  • From Source Code to Machine Code (build-your-own/source-to-machine-code-james-smith)

Supporting local source:

  • Writing a C Compiler (build-your-own/c-compiler-nora-sandler)
Local chunksUse them forProject milestone
Smith 001-003S-expressions, starting code, variables/scopes, testing, computer basicsParser, AST, scopes, and initial tests
Smith 004-008Data stack, virtual instructions, non-local variables, labels, functionsVM-like IR and function representation
Smith 009-013x64 jumps, flags, syscalls, ModR/M, calls, returns, stack framesNative code generation and calling convention
Smith 014-017mmap/ctypes, file layout, entry point, pointers, stringsExecutable memory, binary shape, runtime helpers
Sandler 002-013C subset parsing, AST, expression and operator codegenNative compiler comparison path
Sandler 014-024Control flow, comparisons, condition codesBranching and comparison test suite

4. Implementation milestones

Milestone 1: Tiny language front end

Parse constants, arithmetic, variables, if, while, and function definitions.

Evidence: AST snapshot tests for valid and invalid programs.

Milestone 2: Virtual instruction IR

Lower AST into a simple stack or three-address instruction list.

Evidence: IR dump for each sample program.

Milestone 3: x64 assembly path

Lower virtual instructions to x64 assembly. Start with integer arithmetic and returns.

Evidence: compile and run return (1 + 2) * 3.

Milestone 4: Branches and flags

Implement comparisons, conditional jumps, labels, and loops.

Evidence: generated assembly for if, while, <, <=, ==, and !=.

Milestone 5: Calls and stack frames

Implement function calls, parameters, locals, return values, and recursive calls.

Evidence: factorial or Fibonacci runs correctly; stack-frame layout is documented.

Milestone 6: Runtime helpers

Add memory access, strings, printing, or a tiny standard library.

Evidence: program calls one runtime helper and returns safely.

Milestone 7: Executable-memory or binary output

Either execute generated bytes via a safe harness or emit assembly/object files that the system toolchain links.

Evidence: clean build script from source program to runnable artifact.


5. Tests & evidence

TestEvidence
Front-end correctnessAST snapshots and parse errors
IR correctnesssource and IR side-by-side
Assembly correctnessexpected instruction patterns for small programs
Runtime behaviorcompiled program output matches interpreter output
Calling conventiondocumented stack frame and register use
Crash boundariesbad source fails at compile time, not during execution

6. Compiler design contract

This project needs a stricter boundary between phases than the interpreter project. A useful structure:

frontend/
lexer, parser, AST
semantic/
symbols, scopes, type or arity checks
ir/
virtual instructions, labels, temporaries
codegen/
target-specific lowering
runtime/
printing, allocation, strings, system boundary
tests/
source fixtures, expected IR, expected output

Every phase should have a dump format. If the generated program is wrong, you need to know whether the bug is in parsing, lowering, register/stack placement, branch emission, or runtime linkage.

Target-machine decisions

Document these before codegen:

  • integer width
  • signedness rules
  • caller-saved and callee-saved registers
  • stack alignment
  • argument passing
  • return-value location
  • local-variable layout
  • how labels become addresses
  • how runtime helpers are called

For x64, even a tiny subset must respect stack alignment around calls or the program will fail in confusing ways.

IR requirements

The IR should be boring and explicit:

t1 = const 1
t2 = const 2
t3 = add t1, t2
br_if_zero t3, L_else
call print_int, t3
label L_else
ret t3

Avoid lowering directly from AST to assembly until the project is already working. IR gives you testable checkpoints and makes optimization possible later.


7. Required design notes

Design noteMust answer
Language subsetWhich expressions, statements, and functions are supported?
IR shapeIs it stack-based, three-address, SSA-like, or another form?
Calling conventionWho owns registers and stack cleanup?
Runtime boundaryWhich features are compiled directly and which call helpers?
Error modelWhich errors are compile-time versus runtime?
PortabilityIs the target Linux x64, Windows x64, WASM, or a toy VM?

8. Common failure modes

  • Skipping IR. Direct AST-to-assembly works for constants, then collapses under branches and calls.
  • No golden output tests. Generated assembly needs both text inspection and runtime behavior checks.
  • Wrong stack alignment. Calls into C/runtime helpers may crash only on some platforms.
  • Confusing lexical scope with storage location. A name maps to a symbol first, then a stack slot/register.
  • Branch labels patched too early. Emit symbolic labels first; resolve later.
  • No interpreter oracle. A simple interpreter for the same language is the best correctness reference.

9. Portfolio framing

Publish this as an advanced compiler artifact: language grammar, IR reference, codegen design note, calling-convention note, sample source programs, generated assembly, and comparison against the bytecode compiler path.

Reviewer entry point: one script that compiles a tiny program, prints the IR, prints the assembly, runs it, and verifies the result.


10. Deep project spec

Project contract

Build a compiler path from a tiny source language to executable machine behavior. The minimum target is x64 assembly linked by the system toolchain. The advanced target is executable bytes or object/binary output. The project must define language subset, IR, target platform, ABI assumptions, runtime helpers, and compile-time vs runtime error boundaries.

Source-backed reading map

Source IDUse forRequired output
build-your-own/source-to-machine-code-james-smithIR, virtual instructions, x64 jumps, calls, stack frames, executable memoryIR reference and compile-run pipeline
build-your-own/c-compiler-nora-sandlernative compiler staging and C-like control flowassembly fixture suite and ABI note

Milestone map

MilestoneDeliverableTestsFailure case
Front endparser and ASTAST snapshotssyntax error with source span
IRstack or three-address instructionsIR golden fixturesinvalid symbol use
Assemblyarithmetic and returncompile-run fixtureswrong register/stack use
Brancheslabels, comparisons, loopsif/while fixturesunresolved label
Functionscalls, locals, returnsrecursion fixturearity/stack-frame mismatch
Runtime helpersprint/string/memory helperlinked helper testABI violation
Binary pathexecutable bytes or object outputclean build commandunsafe executable-memory note

Test matrix

Test typeRequired examples
GoldenAST, IR, generated assembly
Differentialcompare compiled program to interpreter result
Integrationsource file to runnable artifact
Negativemalformed source, bad arity, unsupported feature
Benchmarkinterpreter vs compiled arithmetic/loop workload

Design notes required

  • language.md: supported syntax and deliberately unsupported features.
  • ir.md: instruction format, temporaries, labels, stack effect.
  • abi.md: target OS/architecture, registers, stack alignment, calls.
  • runtime.md: helpers and ownership of memory/string values.

Portfolio evidence

Publish source, IR, assembly, executable output, compile script, ABI note, and one regression test for a codegen bug.


Source

This project is based on the local James Smith from-source-code-to-machine-code chunks and the Nora Sandler C compiler chunks.