Build Your Own Regex Engine

"The regular expression (a|b)*abb is one of the most studied strings in computer science."

A regex engine is a tiny compiler in disguise. The recommended NFA-based approach is lexer -> parser -> AST -> NFA -> matcher -- exactly the pipeline of the Interpreter and Compiler tutorials, applied to a smaller language. Doing this project alongside Crafting Interpreters is the cleanest possible introduction to "compile a small DSL down to an executable representation."

1. Overview & motivation

A regex engine takes a pattern like a*b(c|d) and a text, and decides whether the text matches the pattern (and optionally, where).

Two completely different algorithmic approaches exist:

Backtracking (PCRE, Perl, Python's re, JavaScript): recursive search with exponential worst case (a?a?a?a?aaaa is the classic pathological case). Easy to extend with capture groups, backreferences, lookaround.
Automata-based (Go, RE2, awk, grep): compile pattern to NFA, optionally subset-construct DFA, run in linear time O(n x m). Cannot do backreferences but cannot blow up.

What you can only learn by building one:

Why a* and (a|b)* have wildly different cost profiles
Why "regex" in academic CS is a strict subset of what Perl calls a regex
How Thompson's construction mechanically turns a pattern into an NFA
The subset construction that turns an NFA into a DFA
Why ReDoS (regex denial-of-service) is a real attack class

2. Where this fits in the degree

Phase: Systems
Semester: 4 (Systems Programming)
Modules deepened: Module 5 (abstraction & interpretation) -- the regex engine is a small compiler. Module 1 (C/Go/Python fundamentals) -- substantial multi-file project.

Cross-phase relevance:

Directly pairs with the Interpreter tutorial and Compiler tutorial -- same multi-pass pipeline, smaller surface area.
Sets up the Search Engine tutorial -- Russ Cox's regexp4 article connects regex matching to trigram indexes.
The DP / subset-construction angle ties back to Foundations Sem 2 Module 4.

3. Prerequisites

Comfort with recursion and recursive data structures
Familiarity with stacks and graph traversal (BFS/DFS)
Some experience writing a small parser (the Interpreter tutorial is the perfect prep, but not required)

You do not need formal language theory before starting. You will pick up the relevant parts while building.

4. Theory & research

Required reading (one of these)

Russ Cox, "Regular Expression Matching Can Be Simple And Fast" (swtch.com/~rsc/regexp/regexp1.html) -- the most important article on regex engines. Read this first; it is the single article that motivates the entire project.
Russ Cox, "Regular Expression Matching: the Virtual Machine Approach" (swtch.com/~rsc/regexp/regexp2.html) -- the bytecode-VM model that RE2 and Go's regexp use.

Optional but excellent

Brian Kernighan & Rob Pike, The Practice of Programming, Chapter 9 -- the famous 30-line C regex matcher from the original Beautiful Code book.
Thompson, "Regular Expression Search Algorithm," CACM 1968 -- the original paper. Short, dense, and worth reading once you understand the modern explanations.
Aho, Sethi, Ullman, Compilers: Principles, Techniques, and Tools ("Dragon Book") -- Chapter 3 covers regex -> NFA -> DFA in textbook depth.
Cox, "Regular Expression Matching with a Trigram Index" (swtch.com/~rsc/regexp/regexp4.html) -- for the search-engine connection later.

Background on regex theory

Kleene's theorem: regular expressions and finite automata recognize the same class of languages.
Pumping lemma: explains why a^n b^n is not regular and why backreferences make regex non-regular.

5. Curated tutorial list (from BYO-X)

Preserved verbatim with language tags. Choose one as your primary path.

C: A Regular Expression Matcher -- Kernighan & Pike's classic 30-line matcher
C: Regular Expression Matching Can Be Simple And Fast -- Russ Cox
Go: How to build a regex engine from scratch -- Rhaeguard's tutorial
JavaScript: Build a Regex Engine in Less than 40 Lines of Code -- Nick Drane
JavaScript: How to implement regular expressions in functional javascript using derivatives -- Matt Might
JavaScript: Implementing a Regular Expression Engine -- Deniz KÄ±lÄ±Ã§
Perl: How Regexes Work -- Mark Dominus
Python: Build Your Own Regular Expression Engines: Backtracking, NFA, DFA -- Hsin-Yuan Huang
Scala: No Magic: Regular Expressions -- Rafael Bagmanov

6. Recommended primary path

Beginner: Nick Drane's "Build a Regex Engine in Less than 40 Lines of Code" (JavaScript). Backtracking-based. Twelve readable functions. You can finish it in a weekend.

Intermediate (recommended for this degree): Russ Cox's "Regular Expression Matching Can Be Simple And Fast" + a Go or Python implementation. This gives you the NFA-based linear-time approach that real engines (RE2, Go's regexp) use.

Advanced: Implement Cox's bytecode VM approach (regexp2.html) -- this is the structure of Go's regexp package and is the cleanest mental model for a production engine.

For this degree, the NFA path is the right default. It is more conceptually rewarding than backtracking and it shares its entire pipeline with the Interpreter and Compiler tutorials.

7. Implementation milestones

Milestone 1: Backtracking matcher for a tiny subset (Day 1, ~50 lines)

Support exactly: literal characters, ., *, ^, $. This is the Kernighan & Pike matcher.

def match(pattern: str, text: str) -> bool:
    if pattern.startswith('^'):
        return match_here(pattern[1:], text)
    while True:
        if match_here(pattern, text):
            return True
        if not text:
            return False
        text = text[1:]

def match_here(pattern: str, text: str) -> bool:
    if not pattern:
        return True
    if len(pattern) >= 2 and pattern[1] == '*':
        return match_star(pattern[0], pattern[2:], text)
    if pattern == '$':
        return text == ''
    if text and (pattern[0] == '.' or pattern[0] == text[0]):
        return match_here(pattern[1:], text[1:])
    return False

def match_star(c: str, pattern: str, text: str) -> bool:
    while True:
        if match_here(pattern, text):
            return True
        if not text or (text[0] != c and c != '.'):
            return False
        text = text[1:]

Evidence: Tests for ^a$, a.b, a*, ^a*b$ against the textbook examples.

Milestone 2: Extend pattern syntax (Day 2)

Add: +, ?, character classes [abc], [a-z], negated [^abc], alternation a|b, grouping (ab)c.

This is where the simple backtracking matcher starts to creak. Two implementation choices:

Hack + and ? into the existing structure (works but accumulates special cases)
Rewrite as a proper parser -> AST before matching

Choose the second. You will need it for Milestone 3 anyway.

Grammar:
  regex     := alternation
  alternation := concat ('|' concat)*
  concat    := repeat+
  repeat    := atom ('*' | '+' | '?')?
  atom      := char | '.' | '(' regex ')' | char_class
  char_class := '[' '^'? class_item+ ']'

Evidence: Parse tree for (a|b)*c+ rendered as a tree diagram.

Milestone 3: Compile pattern to NFA (Thompson's construction)

Each AST node becomes a small NFA. Concatenation chains them; alternation forks; * adds an epsilon loop.

@dataclass
class State:
    is_match: bool = False
    # transitions: char -> List[State], plus epsilons
    transitions: dict = field(default_factory=dict)
    epsilons: list = field(default_factory=list)

def compile_char(c):
    out = State(is_match=True)
    s = State(transitions={c: [out]})
    return Fragment(start=s, accepts=[out])

def compile_concat(a, b):
    for accept in a.accepts:
        accept.is_match = False
        accept.epsilons.append(b.start)
    return Fragment(start=a.start, accepts=b.accepts)

def compile_alt(a, b):
    start = State(epsilons=[a.start, b.start])
    return Fragment(start=start, accepts=a.accepts + b.accepts)

def compile_star(a):
    new_accept = State(is_match=True)
    start = State(epsilons=[a.start, new_accept])
    for accept in a.accepts:
        accept.is_match = False
        accept.epsilons.append(start)
    return Fragment(start=start, accepts=[new_accept])

Read Cox's article for the canonical version. The pictures in that article are the single best way to internalize this.

Evidence: Print the NFA as a Graphviz .dot file for two patterns: a*b and (a|b)+c.

Milestone 4: Run the NFA in linear time

Maintain a set of currently-active states. On each input character, compute the next set. Don't recurse -- that's how you get exponential blowup.

def epsilon_closure(states):
    result = set(states)
    stack = list(states)
    while stack:
        s = stack.pop()
        for e in s.epsilons:
            if e not in result:
                result.add(e)
                stack.append(e)
    return result

def step(current, char):
    next_states = set()
    for s in current:
        if char in s.transitions:
            next_states.update(s.transitions[char])
    return epsilon_closure(next_states)

def match(nfa, text):
    current = epsilon_closure({nfa.start})
    for c in text:
        current = step(current, c)
    return any(s.is_match for s in current)

Evidence: Show that a?a?a?aaaa (matched against aaaa) runs in milliseconds on your NFA matcher and minutes on Python's stdlib re for a?^n a^n.

Milestone 5: NFA -> DFA via subset construction (optional)

Pre-compute every reachable subset of NFA states; cache transitions. Match becomes a single state-machine walk per character.

This is where regex matching becomes truly linear time. Worth doing once to internalize the construction.

Evidence: Benchmark NFA vs DFA on patterns of varying size. Note the state explosion blowup on certain patterns (the classic DFA worst case is 2^n).

Milestone 6: Capture groups (only if doing NFA-VM approach)

Read Cox's regexp2.html. The bytecode-VM model handles captures by tracking them as side state during the match.

This is the cleanest way to add captures without falling back to backtracking. Cox spells it out completely.

Evidence: Match (\d+)-(\d+) against 2024-11 and recover both groups.

8. Tests & evidence

Test category	Examples
Correctness -- basics	`a`, `abc`, `^abc$`, `.b.`, empty pattern
Correctness -- quantifiers	`a`, `a+`, `a?`, `(ab)`, `a{3}`
Correctness -- alternation/groups	`a\|b`, `(a\|b)c`, `((a\|b)c)+`
Correctness -- anchors	`^a`, `a$`, `^$`
Pathological inputs	`a?^n a^n` for n=10, 20, 30 -- should run in microseconds with NFA, blow up with naive backtracking
ReDoS demonstration	`(a+)+$` on `aaaaaaaaaaaaaaaaaaa!` -- show backtracking blowup
Edge cases	Empty text, empty pattern, pattern with only quantifiers (invalid)

A real submission must include the ReDoS demonstration with measured times.

9. Common pitfalls

Treating * and + as one-off special cases. You will paint yourself into a corner. Build the parser/AST first.
Confusing NFA states with DFA states. An NFA is in a set of states at once; a DFA is in exactly one.
Forgetting epsilon closures. The single most common bug. The NFA can move on Îµ without consuming input -- at every step.
Using recursion for the NFA simulation. It works for small inputs and exposes you to stack overflow on large ones. Use explicit queue/set.
Confusing NFA "accept" with "match done." Match is done when the input is consumed; accept depends on whether you're in an accepting state at that point.
Implementing backreferences in the NFA path. You cannot. Backreferences break Kleene-regularity and are why PCRE has exponential worst-case behavior.

10. Extensions

Unicode support -- characters are not bytes. Build the NFA over code points.
Lookahead and lookbehind -- surprisingly hard. Read about derivatives of regular expressions (Brzozowski).
JIT compile to native code -- see RE2's Prog and the bytecode VM. Russ Cox's articles cover this directly.
Anchored vs unanchored search -- left-most longest match (POSIX) vs left-most first match (Perl/PCRE).
Regex differentiation (Brzozowski) -- Matt Might's article. A wildly different algorithm that produces clean code.

11. Module integration

Module	What the regex engine deepens
Sem 4 Module 1 -- C/Go/Python fundamentals	Multi-file project. Parser/AST/compiler separation.
Sem 4 Module 5 -- Abstraction & interpretation	The regex engine is a small compiler. The pipeline matches Interpreter and Compiler exactly.
Sem 2 Module 4 -- Dynamic programming (Foundations carry-over)	Subset construction is essentially memoized state-set computation.
Sem 2 Module 5 -- Advanced structures (Foundations carry-over)	Efficient DFA state-set representation (bitsets, hash sets).
Search Engine tutorial	Cox's "regexp4" article shows how trigram indexes accelerate regex queries -- directly relevant.

12. Portfolio framing

What to publish:

The engine source, well-organized into lexer/, parser/, compiler/ (pattern -> NFA), vm/ (matcher).
A README that compares your engine to re (Python) or regexp (Go) on the ReDoS test.
One short post-mortem: which features you implemented, which you skipped, and why.

What to keep private:

Failed early attempts (keep them in notes/scratch/ but do not publish).
Raw notes from Cox's articles.

Reviewer entry points:

Start at the match() function (the simulator).
Then compile_to_nfa() (Thompson's construction).
Then the parser.
The README should call out the ReDoS test as the single thing a reviewer should run.

13. Local source backbone

Use Build Your Own Programming Language: DFA-focused chunks (build-your-own/programming-language-affan) as a DFA-heavy companion for the regex and lexer portions of this project.

Local chunks	Use them for	Add to this project
`003`-`008`	Language, grammar, abstraction, lexer/parser split	Add a short lexer-versus-parser design note before regex parsing.
`009`-`022`	Deterministic finite automata examples and construction practice	Expand Milestone 5 with hand-built DFA tables before generated subset construction.

Extra checkpoints from the book chunks

DFA table checkpoint: draw transition tables for identifiers, integers, whitespace, and operators.
Lexer checkpoint: turn the DFA tables into a scanner for a tiny expression language.
Regex checkpoint: compare hand-built DFA behavior with generated NFA-to-DFA output.
Failure checkpoint: document one pattern where DFA construction explodes in state count.

14. Deep project spec

Project contract

Build a regex engine for a documented subset: literals, concatenation, alternation, grouping, *, +, ?, character classes, anchors if included, and escaping rules. The engine must state whether it uses backtracking, Thompson NFA simulation, DFA construction, or multiple implementations for comparison.

Source-backed reading map

Source ID	Use for	Required output
`build-your-own/programming-language-affan`	DFA/NFA thinking and lexer construction	automata diagrams and DFA table
`build-your-own/programming-language-jeffery`	tokenization context and grammar limits	note contrasting regex and parser grammars

Milestone map

Milestone	Deliverable	Tests	Failure case
Parser	regex AST	AST snapshots	unbalanced group
NFA builder	Thompson fragments	graph/dump fixtures	invalid quantifier placement
Matcher	NFA simulation	full-match and search fixtures	empty-string loop
DFA converter	subset construction	state-table fixtures	exponential-state warning
Character classes	ranges and negation	class membership tests	malformed range
Benchmark	compare engines	timing report	catastrophic backtracking pattern

Test matrix

Test type	Required examples
Golden	AST, NFA, and DFA text dumps
Differential	compare accepted subset against a host-language regex engine
Property	generated strings match equivalence across NFA and DFA engines
Performance	backtracking vs NFA vs DFA on worst-case patterns

Design notes required

syntax.md: supported regex grammar and precedence.
automata.md: AST to NFA to DFA pipeline and state representation.
matching.md: full-match vs search semantics, Unicode/ASCII decision, empty-match behavior.

Portfolio evidence

Publish the regex grammar, automata dumps for three patterns, benchmark results for a catastrophic-backtracking case, and one note explaining why the chosen engine avoids or accepts that risk.

Source

This tutorial cites the BYO-X catalog entry for "Regex Engine". Russ Cox's articles are the primary technical source.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading (one of these)​

Optional but excellent​

Background on regex theory​

5. Curated tutorial list (from BYO-X)​

6. Recommended primary path​

7. Implementation milestones​

Milestone 1: Backtracking matcher for a tiny subset (Day 1, ~50 lines)​

Milestone 2: Extend pattern syntax (Day 2)​

Milestone 3: Compile pattern to NFA (Thompson's construction)​

Milestone 4: Run the NFA in linear time​

Milestone 5: NFA -> DFA via subset construction (optional)​

Milestone 6: Capture groups (only if doing NFA-VM approach)​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

13. Local source backbone​

Extra checkpoints from the book chunks​

14. Deep project spec​

Project contract​

Source-backed reading map​

Milestone map​

Test matrix​

Design notes required​

Portfolio evidence​

Source​