Build Your Own Regex Engine
"The regular expression
(a|b)*abbis one of the most studied strings in computer science."
A regex engine is a tiny compiler in disguise. The recommended NFA-based approach is lexer -> parser -> AST -> NFA -> matcher -- exactly the pipeline of the Interpreter and Compiler tutorials, applied to a smaller language. Doing this project alongside Crafting Interpreters is the cleanest possible introduction to "compile a small DSL down to an executable representation."
1. Overview & motivation
A regex engine takes a pattern like a*b(c|d) and a text, and decides whether the text matches the pattern (and optionally, where).
Two completely different algorithmic approaches exist:
- Backtracking (PCRE, Perl, Python's
re, JavaScript): recursive search with exponential worst case (a?a?a?a?aaaais the classic pathological case). Easy to extend with capture groups, backreferences, lookaround. - Automata-based (Go, RE2, awk, grep): compile pattern to NFA, optionally subset-construct DFA, run in linear time
O(n x m). Cannot do backreferences but cannot blow up.
What you can only learn by building one:
- Why
a*and(a|b)*have wildly different cost profiles - Why "regex" in academic CS is a strict subset of what Perl calls a regex
- How Thompson's construction mechanically turns a pattern into an NFA
- The subset construction that turns an NFA into a DFA
- Why ReDoS (regex denial-of-service) is a real attack class
2. Where this fits in the degree
- Phase: Systems
- Semester: 4 (Systems Programming)
- Modules deepened: Module 5 (abstraction & interpretation) -- the regex engine is a small compiler. Module 1 (C/Go/Python fundamentals) -- substantial multi-file project.
Cross-phase relevance:
- Directly pairs with the Interpreter tutorial and Compiler tutorial -- same multi-pass pipeline, smaller surface area.
- Sets up the Search Engine tutorial -- Russ Cox's regexp4 article connects regex matching to trigram indexes.
- The DP / subset-construction angle ties back to Foundations Sem 2 Module 4.
3. Prerequisites
- Comfort with recursion and recursive data structures
- Familiarity with stacks and graph traversal (BFS/DFS)
- Some experience writing a small parser (the Interpreter tutorial is the perfect prep, but not required)
You do not need formal language theory before starting. You will pick up the relevant parts while building.
4. Theory & research
Required reading (one of these)
- Russ Cox, "Regular Expression Matching Can Be Simple And Fast" (swtch.com/~rsc/regexp/regexp1.html) -- the most important article on regex engines. Read this first; it is the single article that motivates the entire project.
- Russ Cox, "Regular Expression Matching: the Virtual Machine Approach" (swtch.com/~rsc/regexp/regexp2.html) -- the bytecode-VM model that RE2 and Go's
regexpuse.
Optional but excellent
- Brian Kernighan & Rob Pike, The Practice of Programming, Chapter 9 -- the famous 30-line C regex matcher from the original Beautiful Code book.
- Thompson, "Regular Expression Search Algorithm," CACM 1968 -- the original paper. Short, dense, and worth reading once you understand the modern explanations.
- Aho, Sethi, Ullman, Compilers: Principles, Techniques, and Tools ("Dragon Book") -- Chapter 3 covers regex -> NFA -> DFA in textbook depth.
- Cox, "Regular Expression Matching with a Trigram Index" (swtch.com/~rsc/regexp/regexp4.html) -- for the search-engine connection later.
Background on regex theory
- Kleene's theorem: regular expressions and finite automata recognize the same class of languages.
- Pumping lemma: explains why
a^n b^nis not regular and why backreferences make regex non-regular.
5. Curated tutorial list (from BYO-X)
Preserved verbatim with language tags. Choose one as your primary path.
- C: A Regular Expression Matcher -- Kernighan & Pike's classic 30-line matcher
- C: Regular Expression Matching Can Be Simple And Fast -- Russ Cox
- Go: How to build a regex engine from scratch -- Rhaeguard's tutorial
- JavaScript: Build a Regex Engine in Less than 40 Lines of Code -- Nick Drane
- JavaScript: How to implement regular expressions in functional javascript using derivatives -- Matt Might
- JavaScript: Implementing a Regular Expression Engine -- Deniz Kılıç
- Perl: How Regexes Work -- Mark Dominus
- Python: Build Your Own Regular Expression Engines: Backtracking, NFA, DFA -- Hsin-Yuan Huang
- Scala: No Magic: Regular Expressions -- Rafael Bagmanov
6. Recommended primary path
Beginner: Nick Drane's "Build a Regex Engine in Less than 40 Lines of Code" (JavaScript). Backtracking-based. Twelve readable functions. You can finish it in a weekend.
Intermediate (recommended for this degree): Russ Cox's "Regular Expression Matching Can Be Simple And Fast" + a Go or Python implementation. This gives you the NFA-based linear-time approach that real engines (RE2, Go's regexp) use.
Advanced: Implement Cox's bytecode VM approach (regexp2.html) -- this is the structure of Go's regexp package and is the cleanest mental model for a production engine.
For this degree, the NFA path is the right default. It is more conceptually rewarding than backtracking and it shares its entire pipeline with the Interpreter and Compiler tutorials.
7. Implementation milestones
Milestone 1: Backtracking matcher for a tiny subset (Day 1, ~50 lines)
Support exactly: literal characters, ., *, ^, $. This is the Kernighan & Pike matcher.
def match(pattern: str, text: str) -> bool:
if pattern.startswith('^'):
return match_here(pattern[1:], text)
while True:
if match_here(pattern, text):
return True
if not text:
return False
text = text[1:]
def match_here(pattern: str, text: str) -> bool:
if not pattern:
return True
if len(pattern) >= 2 and pattern[1] == '*':
return match_star(pattern[0], pattern[2:], text)
if pattern == '$':
return text == ''
if text and (pattern[0] == '.' or pattern[0] == text[0]):
return match_here(pattern[1:], text[1:])
return False
def match_star(c: str, pattern: str, text: str) -> bool:
while True:
if match_here(pattern, text):
return True
if not text or (text[0] != c and c != '.'):
return False
text = text[1:]
Evidence: Tests for ^a$, a.b, a*, ^a*b$ against the textbook examples.
Milestone 2: Extend pattern syntax (Day 2)
Add: +, ?, character classes [abc], [a-z], negated [^abc], alternation a|b, grouping (ab)c.
This is where the simple backtracking matcher starts to creak. Two implementation choices:
- Hack
+and?into the existing structure (works but accumulates special cases) - Rewrite as a proper parser -> AST before matching
Choose the second. You will need it for Milestone 3 anyway.
Grammar:
regex := alternation
alternation := concat ('|' concat)*
concat := repeat+
repeat := atom ('*' | '+' | '?')?
atom := char | '.' | '(' regex ')' | char_class
char_class := '[' '^'? class_item+ ']'
Evidence: Parse tree for (a|b)*c+ rendered as a tree diagram.
Milestone 3: Compile pattern to NFA (Thompson's construction)
Each AST node becomes a small NFA. Concatenation chains them; alternation forks; * adds an epsilon loop.
@dataclass
class State:
is_match: bool = False
# transitions: char -> List[State], plus epsilons
transitions: dict = field(default_factory=dict)
epsilons: list = field(default_factory=list)
def compile_char(c):
out = State(is_match=True)
s = State(transitions={c: [out]})
return Fragment(start=s, accepts=[out])
def compile_concat(a, b):
for accept in a.accepts:
accept.is_match = False
accept.epsilons.append(b.start)
return Fragment(start=a.start, accepts=b.accepts)
def compile_alt(a, b):
start = State(epsilons=[a.start, b.start])
return Fragment(start=start, accepts=a.accepts + b.accepts)
def compile_star(a):
new_accept = State(is_match=True)
start = State(epsilons=[a.start, new_accept])
for accept in a.accepts:
accept.is_match = False
accept.epsilons.append(start)
return Fragment(start=start, accepts=[new_accept])
Read Cox's article for the canonical version. The pictures in that article are the single best way to internalize this.
Evidence: Print the NFA as a Graphviz .dot file for two patterns: a*b and (a|b)+c.
Milestone 4: Run the NFA in linear time
Maintain a set of currently-active states. On each input character, compute the next set. Don't recurse -- that's how you get exponential blowup.
def epsilon_closure(states):
result = set(states)
stack = list(states)
while stack:
s = stack.pop()
for e in s.epsilons:
if e not in result:
result.add(e)
stack.append(e)
return result
def step(current, char):
next_states = set()
for s in current:
if char in s.transitions:
next_states.update(s.transitions[char])
return epsilon_closure(next_states)
def match(nfa, text):
current = epsilon_closure({nfa.start})
for c in text:
current = step(current, c)
return any(s.is_match for s in current)
Evidence: Show that a?a?a?aaaa (matched against aaaa) runs in milliseconds on your NFA matcher and minutes on Python's stdlib re for a?^n a^n.
Milestone 5: NFA -> DFA via subset construction (optional)
Pre-compute every reachable subset of NFA states; cache transitions. Match becomes a single state-machine walk per character.
This is where regex matching becomes truly linear time. Worth doing once to internalize the construction.
Evidence: Benchmark NFA vs DFA on patterns of varying size. Note the state explosion blowup on certain patterns (the classic DFA worst case is 2^n).
Milestone 6: Capture groups (only if doing NFA-VM approach)
Read Cox's regexp2.html. The bytecode-VM model handles captures by tracking them as side state during the match.
This is the cleanest way to add captures without falling back to backtracking. Cox spells it out completely.
Evidence: Match (\d+)-(\d+) against 2024-11 and recover both groups.
8. Tests & evidence
| Test category | Examples |
|---|---|
| Correctness -- basics | a, abc, ^abc$, .b., empty pattern |
| Correctness -- quantifiers | a*, a+, a?, (ab)*, a{3} |
| Correctness -- alternation/groups | a|b, (a|b)c, ((a|b)c)+ |
| Correctness -- anchors | ^a, a$, ^$ |
| Pathological inputs | a?^n a^n for n=10, 20, 30 -- should run in microseconds with NFA, blow up with naive backtracking |
| ReDoS demonstration | (a+)+$ on aaaaaaaaaaaaaaaaaaa! -- show backtracking blowup |
| Edge cases | Empty text, empty pattern, pattern with only quantifiers (invalid) |
A real submission must include the ReDoS demonstration with measured times.
9. Common pitfalls
- Treating
*and+as one-off special cases. You will paint yourself into a corner. Build the parser/AST first. - Confusing NFA states with DFA states. An NFA is in a set of states at once; a DFA is in exactly one.
- Forgetting epsilon closures. The single most common bug. The NFA can move on
εwithout consuming input -- at every step. - Using recursion for the NFA simulation. It works for small inputs and exposes you to stack overflow on large ones. Use explicit queue/set.
- Confusing NFA "accept" with "match done." Match is done when the input is consumed; accept depends on whether you're in an accepting state at that point.
- Implementing backreferences in the NFA path. You cannot. Backreferences break Kleene-regularity and are why PCRE has exponential worst-case behavior.
10. Extensions
- Unicode support -- characters are not bytes. Build the NFA over code points.
- Lookahead and lookbehind -- surprisingly hard. Read about derivatives of regular expressions (Brzozowski).
- JIT compile to native code -- see RE2's
Progand the bytecode VM. Russ Cox's articles cover this directly. - Anchored vs unanchored search -- left-most longest match (POSIX) vs left-most first match (Perl/PCRE).
- Regex differentiation (Brzozowski) -- Matt Might's article. A wildly different algorithm that produces clean code.
11. Module integration
| Module | What the regex engine deepens |
|---|---|
| Sem 4 Module 1 -- C/Go/Python fundamentals | Multi-file project. Parser/AST/compiler separation. |
| Sem 4 Module 5 -- Abstraction & interpretation | The regex engine is a small compiler. The pipeline matches Interpreter and Compiler exactly. |
| Sem 2 Module 4 -- Dynamic programming (Foundations carry-over) | Subset construction is essentially memoized state-set computation. |
| Sem 2 Module 5 -- Advanced structures (Foundations carry-over) | Efficient DFA state-set representation (bitsets, hash sets). |
| Search Engine tutorial | Cox's "regexp4" article shows how trigram indexes accelerate regex queries -- directly relevant. |
12. Portfolio framing
What to publish:
- The engine source, well-organized into
lexer/,parser/,compiler/(pattern -> NFA),vm/(matcher). - A README that compares your engine to
re(Python) orregexp(Go) on the ReDoS test. - One short post-mortem: which features you implemented, which you skipped, and why.
What to keep private:
- Failed early attempts (keep them in
notes/scratch/but do not publish). - Raw notes from Cox's articles.
Reviewer entry points:
- Start at the
match()function (the simulator). - Then
compile_to_nfa()(Thompson's construction). - Then the parser.
- The README should call out the ReDoS test as the single thing a reviewer should run.
13. Local source backbone
Use Build Your Own Programming Language: DFA-focused chunks (build-your-own/programming-language-affan) as a DFA-heavy companion for the regex and lexer portions of this project.
| Local chunks | Use them for | Add to this project |
|---|---|---|
003-008 | Language, grammar, abstraction, lexer/parser split | Add a short lexer-versus-parser design note before regex parsing. |
009-022 | Deterministic finite automata examples and construction practice | Expand Milestone 5 with hand-built DFA tables before generated subset construction. |
Extra checkpoints from the book chunks
- DFA table checkpoint: draw transition tables for identifiers, integers, whitespace, and operators.
- Lexer checkpoint: turn the DFA tables into a scanner for a tiny expression language.
- Regex checkpoint: compare hand-built DFA behavior with generated NFA-to-DFA output.
- Failure checkpoint: document one pattern where DFA construction explodes in state count.
14. Deep project spec
Project contract
Build a regex engine for a documented subset: literals, concatenation, alternation, grouping, *, +, ?, character classes, anchors if included, and escaping rules. The engine must state whether it uses backtracking, Thompson NFA simulation, DFA construction, or multiple implementations for comparison.
Source-backed reading map
| Source ID | Use for | Required output |
|---|---|---|
build-your-own/programming-language-affan | DFA/NFA thinking and lexer construction | automata diagrams and DFA table |
build-your-own/programming-language-jeffery | tokenization context and grammar limits | note contrasting regex and parser grammars |
Milestone map
| Milestone | Deliverable | Tests | Failure case |
|---|---|---|---|
| Parser | regex AST | AST snapshots | unbalanced group |
| NFA builder | Thompson fragments | graph/dump fixtures | invalid quantifier placement |
| Matcher | NFA simulation | full-match and search fixtures | empty-string loop |
| DFA converter | subset construction | state-table fixtures | exponential-state warning |
| Character classes | ranges and negation | class membership tests | malformed range |
| Benchmark | compare engines | timing report | catastrophic backtracking pattern |
Test matrix
| Test type | Required examples |
|---|---|
| Golden | AST, NFA, and DFA text dumps |
| Differential | compare accepted subset against a host-language regex engine |
| Property | generated strings match equivalence across NFA and DFA engines |
| Performance | backtracking vs NFA vs DFA on worst-case patterns |
Design notes required
syntax.md: supported regex grammar and precedence.automata.md: AST to NFA to DFA pipeline and state representation.matching.md: full-match vs search semantics, Unicode/ASCII decision, empty-match behavior.
Portfolio evidence
Publish the regex grammar, automata dumps for three patterns, benchmark results for a catastrophic-backtracking case, and one note explaining why the chosen engine avoids or accepts that risk.
Source
This tutorial cites the BYO-X catalog entry for "Regex Engine". Russ Cox's articles are the primary technical source.