Build From Source Code to Machine Code
This project is the native-code path through compiler construction. Instead of stopping at a tree-walking interpreter or bytecode VM, you lower a tiny language into executable machine behavior: virtual instructions, real assembly, calling conventions, stack frames, executable memory, and eventually object or binary layout.
Use this after Interpreter, or as the advanced path after Compiler.
1. Overview & motivation
Build a small compiler pipeline:
source -> parser -> AST -> semantic checks -> virtual instructions
-> assembly or machine code -> executable function/program
The language can be small: integer expressions, variables, conditionals, loops, and functions. The value is in the lowering path.
2. Where this fits in the degree
- Phase: Systems
- Semester: 4 (Systems Programming)
- Modules deepened: Module 2 (machine representation), Module 3 (computer organization), Module 5 (abstraction and interpretation)
Cross-phase relevance:
- Builds on Compiler but changes the target from bytecode to machine code.
- Connects to Toy Computer: a target machine is a contract.
- Helps explain how language runtimes, JITs, FFI, and executable formats work.
3. Local source backbone
Primary local source:
- From Source Code to Machine Code (
build-your-own/source-to-machine-code-james-smith)
Supporting local source:
- Writing a C Compiler (
build-your-own/c-compiler-nora-sandler)
| Local chunks | Use them for | Project milestone |
|---|---|---|
Smith 001-003 | S-expressions, starting code, variables/scopes, testing, computer basics | Parser, AST, scopes, and initial tests |
Smith 004-008 | Data stack, virtual instructions, non-local variables, labels, functions | VM-like IR and function representation |
Smith 009-013 | x64 jumps, flags, syscalls, ModR/M, calls, returns, stack frames | Native code generation and calling convention |
Smith 014-017 | mmap/ctypes, file layout, entry point, pointers, strings | Executable memory, binary shape, runtime helpers |
Sandler 002-013 | C subset parsing, AST, expression and operator codegen | Native compiler comparison path |
Sandler 014-024 | Control flow, comparisons, condition codes | Branching and comparison test suite |
4. Implementation milestones
Milestone 1: Tiny language front end
Parse constants, arithmetic, variables, if, while, and function definitions.
Evidence: AST snapshot tests for valid and invalid programs.
Milestone 2: Virtual instruction IR
Lower AST into a simple stack or three-address instruction list.
Evidence: IR dump for each sample program.
Milestone 3: x64 assembly path
Lower virtual instructions to x64 assembly. Start with integer arithmetic and returns.
Evidence: compile and run return (1 + 2) * 3.
Milestone 4: Branches and flags
Implement comparisons, conditional jumps, labels, and loops.
Evidence: generated assembly for if, while, <, <=, ==, and !=.
Milestone 5: Calls and stack frames
Implement function calls, parameters, locals, return values, and recursive calls.
Evidence: factorial or Fibonacci runs correctly; stack-frame layout is documented.
Milestone 6: Runtime helpers
Add memory access, strings, printing, or a tiny standard library.
Evidence: program calls one runtime helper and returns safely.
Milestone 7: Executable-memory or binary output
Either execute generated bytes via a safe harness or emit assembly/object files that the system toolchain links.
Evidence: clean build script from source program to runnable artifact.
5. Tests & evidence
| Test | Evidence |
|---|---|
| Front-end correctness | AST snapshots and parse errors |
| IR correctness | source and IR side-by-side |
| Assembly correctness | expected instruction patterns for small programs |
| Runtime behavior | compiled program output matches interpreter output |
| Calling convention | documented stack frame and register use |
| Crash boundaries | bad source fails at compile time, not during execution |
6. Compiler design contract
This project needs a stricter boundary between phases than the interpreter project. A useful structure:
frontend/
lexer, parser, AST
semantic/
symbols, scopes, type or arity checks
ir/
virtual instructions, labels, temporaries
codegen/
target-specific lowering
runtime/
printing, allocation, strings, system boundary
tests/
source fixtures, expected IR, expected output
Every phase should have a dump format. If the generated program is wrong, you need to know whether the bug is in parsing, lowering, register/stack placement, branch emission, or runtime linkage.
Target-machine decisions
Document these before codegen:
- integer width
- signedness rules
- caller-saved and callee-saved registers
- stack alignment
- argument passing
- return-value location
- local-variable layout
- how labels become addresses
- how runtime helpers are called
For x64, even a tiny subset must respect stack alignment around calls or the program will fail in confusing ways.
IR requirements
The IR should be boring and explicit:
t1 = const 1
t2 = const 2
t3 = add t1, t2
br_if_zero t3, L_else
call print_int, t3
label L_else
ret t3
Avoid lowering directly from AST to assembly until the project is already working. IR gives you testable checkpoints and makes optimization possible later.
7. Required design notes
| Design note | Must answer |
|---|---|
| Language subset | Which expressions, statements, and functions are supported? |
| IR shape | Is it stack-based, three-address, SSA-like, or another form? |
| Calling convention | Who owns registers and stack cleanup? |
| Runtime boundary | Which features are compiled directly and which call helpers? |
| Error model | Which errors are compile-time versus runtime? |
| Portability | Is the target Linux x64, Windows x64, WASM, or a toy VM? |
8. Common failure modes
- Skipping IR. Direct AST-to-assembly works for constants, then collapses under branches and calls.
- No golden output tests. Generated assembly needs both text inspection and runtime behavior checks.
- Wrong stack alignment. Calls into C/runtime helpers may crash only on some platforms.
- Confusing lexical scope with storage location. A name maps to a symbol first, then a stack slot/register.
- Branch labels patched too early. Emit symbolic labels first; resolve later.
- No interpreter oracle. A simple interpreter for the same language is the best correctness reference.
9. Portfolio framing
Publish this as an advanced compiler artifact: language grammar, IR reference, codegen design note, calling-convention note, sample source programs, generated assembly, and comparison against the bytecode compiler path.
Reviewer entry point: one script that compiles a tiny program, prints the IR, prints the assembly, runs it, and verifies the result.
10. Deep project spec
Project contract
Build a compiler path from a tiny source language to executable machine behavior. The minimum target is x64 assembly linked by the system toolchain. The advanced target is executable bytes or object/binary output. The project must define language subset, IR, target platform, ABI assumptions, runtime helpers, and compile-time vs runtime error boundaries.
Source-backed reading map
| Source ID | Use for | Required output |
|---|---|---|
build-your-own/source-to-machine-code-james-smith | IR, virtual instructions, x64 jumps, calls, stack frames, executable memory | IR reference and compile-run pipeline |
build-your-own/c-compiler-nora-sandler | native compiler staging and C-like control flow | assembly fixture suite and ABI note |
Milestone map
| Milestone | Deliverable | Tests | Failure case |
|---|---|---|---|
| Front end | parser and AST | AST snapshots | syntax error with source span |
| IR | stack or three-address instructions | IR golden fixtures | invalid symbol use |
| Assembly | arithmetic and return | compile-run fixtures | wrong register/stack use |
| Branches | labels, comparisons, loops | if/while fixtures | unresolved label |
| Functions | calls, locals, returns | recursion fixture | arity/stack-frame mismatch |
| Runtime helpers | print/string/memory helper | linked helper test | ABI violation |
| Binary path | executable bytes or object output | clean build command | unsafe executable-memory note |
Test matrix
| Test type | Required examples |
|---|---|
| Golden | AST, IR, generated assembly |
| Differential | compare compiled program to interpreter result |
| Integration | source file to runnable artifact |
| Negative | malformed source, bad arity, unsupported feature |
| Benchmark | interpreter vs compiled arithmetic/loop workload |
Design notes required
language.md: supported syntax and deliberately unsupported features.ir.md: instruction format, temporaries, labels, stack effect.abi.md: target OS/architecture, registers, stack alignment, calls.runtime.md: helpers and ownership of memory/string values.
Portfolio evidence
Publish source, IR, assembly, executable output, compile script, ABI note, and one regression test for a codegen bug.
Source
This project is based on the local James Smith from-source-code-to-machine-code chunks and the Nora Sandler C compiler chunks.