Build From Source Code to Machine Code

This project is the native-code path through compiler construction. Instead of stopping at a tree-walking interpreter or bytecode VM, you lower a tiny language into executable machine behavior: virtual instructions, real assembly, calling conventions, stack frames, executable memory, and eventually object or binary layout.

Use this after Interpreter, or as the advanced path after Compiler.

1. Overview & motivation

Build a small compiler pipeline:

source -> parser -> AST -> semantic checks -> virtual instructions
       -> assembly or machine code -> executable function/program

The language can be small: integer expressions, variables, conditionals, loops, and functions. The value is in the lowering path.

2. Where this fits in the degree

Phase: Systems
Semester: 4 (Systems Programming)
Modules deepened: Module 2 (machine representation), Module 3 (computer organization), Module 5 (abstraction and interpretation)

Cross-phase relevance:

Builds on Compiler but changes the target from bytecode to machine code.
Connects to Toy Computer: a target machine is a contract.
Helps explain how language runtimes, JITs, FFI, and executable formats work.

3. Local source backbone

Primary local source:

From Source Code to Machine Code (build-your-own/source-to-machine-code-james-smith)

Supporting local source:

Writing a C Compiler (build-your-own/c-compiler-nora-sandler)

Local chunks	Use them for	Project milestone
Smith `001`-`003`	S-expressions, starting code, variables/scopes, testing, computer basics	Parser, AST, scopes, and initial tests
Smith `004`-`008`	Data stack, virtual instructions, non-local variables, labels, functions	VM-like IR and function representation
Smith `009`-`013`	x64 jumps, flags, syscalls, ModR/M, calls, returns, stack frames	Native code generation and calling convention
Smith `014`-`017`	mmap/ctypes, file layout, entry point, pointers, strings	Executable memory, binary shape, runtime helpers
Sandler `002`-`013`	C subset parsing, AST, expression and operator codegen	Native compiler comparison path
Sandler `014`-`024`	Control flow, comparisons, condition codes	Branching and comparison test suite

4. Implementation milestones

Milestone 1: Tiny language front end

Parse constants, arithmetic, variables, if, while, and function definitions.

Evidence: AST snapshot tests for valid and invalid programs.

Milestone 2: Virtual instruction IR

Lower AST into a simple stack or three-address instruction list.

Evidence: IR dump for each sample program.

Milestone 3: x64 assembly path

Lower virtual instructions to x64 assembly. Start with integer arithmetic and returns.

Evidence: compile and run return (1 + 2) * 3.

Milestone 4: Branches and flags

Implement comparisons, conditional jumps, labels, and loops.

Evidence: generated assembly for if, while, <, <=, ==, and !=.

Milestone 5: Calls and stack frames

Implement function calls, parameters, locals, return values, and recursive calls.

Evidence: factorial or Fibonacci runs correctly; stack-frame layout is documented.

Milestone 6: Runtime helpers

Add memory access, strings, printing, or a tiny standard library.

Evidence: program calls one runtime helper and returns safely.

Milestone 7: Executable-memory or binary output

Either execute generated bytes via a safe harness or emit assembly/object files that the system toolchain links.

Evidence: clean build script from source program to runnable artifact.

5. Tests & evidence

Test	Evidence
Front-end correctness	AST snapshots and parse errors
IR correctness	source and IR side-by-side
Assembly correctness	expected instruction patterns for small programs
Runtime behavior	compiled program output matches interpreter output
Calling convention	documented stack frame and register use
Crash boundaries	bad source fails at compile time, not during execution

6. Compiler design contract

This project needs a stricter boundary between phases than the interpreter project. A useful structure:

frontend/
  lexer, parser, AST
semantic/
  symbols, scopes, type or arity checks
ir/
  virtual instructions, labels, temporaries
codegen/
  target-specific lowering
runtime/
  printing, allocation, strings, system boundary
tests/
  source fixtures, expected IR, expected output

Every phase should have a dump format. If the generated program is wrong, you need to know whether the bug is in parsing, lowering, register/stack placement, branch emission, or runtime linkage.

Target-machine decisions

Document these before codegen:

integer width
signedness rules
caller-saved and callee-saved registers
stack alignment
argument passing
return-value location
local-variable layout
how labels become addresses
how runtime helpers are called

For x64, even a tiny subset must respect stack alignment around calls or the program will fail in confusing ways.

IR requirements

The IR should be boring and explicit:

t1 = const 1
t2 = const 2
t3 = add t1, t2
br_if_zero t3, L_else
call print_int, t3
label L_else
ret t3

Avoid lowering directly from AST to assembly until the project is already working. IR gives you testable checkpoints and makes optimization possible later.

7. Required design notes

Design note	Must answer
Language subset	Which expressions, statements, and functions are supported?
IR shape	Is it stack-based, three-address, SSA-like, or another form?
Calling convention	Who owns registers and stack cleanup?
Runtime boundary	Which features are compiled directly and which call helpers?
Error model	Which errors are compile-time versus runtime?
Portability	Is the target Linux x64, Windows x64, WASM, or a toy VM?

8. Common failure modes

Skipping IR. Direct AST-to-assembly works for constants, then collapses under branches and calls.
No golden output tests. Generated assembly needs both text inspection and runtime behavior checks.
Wrong stack alignment. Calls into C/runtime helpers may crash only on some platforms.
Confusing lexical scope with storage location. A name maps to a symbol first, then a stack slot/register.
Branch labels patched too early. Emit symbolic labels first; resolve later.
No interpreter oracle. A simple interpreter for the same language is the best correctness reference.

9. Portfolio framing

Publish this as an advanced compiler artifact: language grammar, IR reference, codegen design note, calling-convention note, sample source programs, generated assembly, and comparison against the bytecode compiler path.

Reviewer entry point: one script that compiles a tiny program, prints the IR, prints the assembly, runs it, and verifies the result.

10. Deep project spec

Project contract

Build a compiler path from a tiny source language to executable machine behavior. The minimum target is x64 assembly linked by the system toolchain. The advanced target is executable bytes or object/binary output. The project must define language subset, IR, target platform, ABI assumptions, runtime helpers, and compile-time vs runtime error boundaries.

Source-backed reading map

Source ID	Use for	Required output
`build-your-own/source-to-machine-code-james-smith`	IR, virtual instructions, x64 jumps, calls, stack frames, executable memory	IR reference and compile-run pipeline
`build-your-own/c-compiler-nora-sandler`	native compiler staging and C-like control flow	assembly fixture suite and ABI note

Milestone map

Milestone	Deliverable	Tests	Failure case
Front end	parser and AST	AST snapshots	syntax error with source span
IR	stack or three-address instructions	IR golden fixtures	invalid symbol use
Assembly	arithmetic and return	compile-run fixtures	wrong register/stack use
Branches	labels, comparisons, loops	if/while fixtures	unresolved label
Functions	calls, locals, returns	recursion fixture	arity/stack-frame mismatch
Runtime helpers	print/string/memory helper	linked helper test	ABI violation
Binary path	executable bytes or object output	clean build command	unsafe executable-memory note

Test matrix

Test type	Required examples
Golden	AST, IR, generated assembly
Differential	compare compiled program to interpreter result
Integration	source file to runnable artifact
Negative	malformed source, bad arity, unsupported feature
Benchmark	interpreter vs compiled arithmetic/loop workload

Design notes required

language.md: supported syntax and deliberately unsupported features.
ir.md: instruction format, temporaries, labels, stack effect.
abi.md: target OS/architecture, registers, stack alignment, calls.
runtime.md: helpers and ownership of memory/string values.

Portfolio evidence

Publish source, IR, assembly, executable output, compile script, ABI note, and one regression test for a codegen bug.

Source

This project is based on the local James Smith from-source-code-to-machine-code chunks and the Nora Sandler C compiler chunks.

1. Overview & motivation​

2. Where this fits in the degree​

3. Local source backbone​

4. Implementation milestones​

Milestone 1: Tiny language front end​

Milestone 2: Virtual instruction IR​

Milestone 3: x64 assembly path​

Milestone 4: Branches and flags​

Milestone 5: Calls and stack frames​

Milestone 6: Runtime helpers​

Milestone 7: Executable-memory or binary output​

5. Tests & evidence​

6. Compiler design contract​

Target-machine decisions​

IR requirements​

7. Required design notes​

8. Common failure modes​

9. Portfolio framing​

10. Deep project spec​

Project contract​

Source-backed reading map​

Milestone map​

Test matrix​

Design notes required​

Portfolio evidence​

Source​