Edit Distance and Sequence Alignment
What This Concept Is
Given strings $s$ of length $m$ and $t$ of length $n$, the edit distance (Levenshtein distance) is the minimum number of single-character operations -- insert, delete, substitute -- that transform $s$ into $t$. Let $dp[i][j]$ be the edit distance between the prefix $s[1..i]$ and $t[1..j]$. The recurrence:
$$dp[i][j] = \min\begin{cases} dp[i-1][j] + 1 & \text{delete } s_i \ dp[i][j-1] + 1 & \text{insert } t_j \ dp[i-1][j-1] + [s_i \ne t_j] & \text{match or substitute} \end{cases}$$
with base cases $dp[0][j] = j$ (insert $j$ characters into the empty prefix) and $dp[i][0] = i$ (delete $i$ characters).
Edit distance is the canonical two-sequence / sequence alignment DP. Variants share the same shape -- a 2D table with transitions from three neighboring cells -- but differ in cost structure:
- Longest common subsequence (LCS). Counts matches rather than minimizes operations.
- Weighted edit distance. Per-operation costs $c_{\text{ins}}, c_{\text{del}}, c_{\text{sub}}$; Skiena 8.2.4.
- Needleman-Wunsch global alignment. Biosequence alignment; adds match score and gap penalty.
- Smith-Waterman local alignment. Allows "reset to 0," finds the best locally-aligned region.
- Affine gap penalties. Opening a gap costs more than extending it; requires three DP layers
(match, gap_s, gap_t). - Damerau-Levenshtein. Adds transposition of adjacent characters at cost 1.
Why It Matters Here
Edit distance is the hinge problem for two-sequence DP. Once you understand it, LCS, diff, DNA alignment, spell-checker ranking, and many NLP similarity scores become the same problem with different costs. The 2D table and the three-direction transition are worth internalizing deeply; you will see them transformed many times in later modules.
Its complexity is $O(mn)$ time and $O(mn)$ space for the full table, $O(\min(m, n))$ space with rolling + Hirschberg for reconstruction. For $m = n = 10^5$ this is tight but feasible; above that you need more specialized structures (bit-parallel Myers, banded alignment, or approximate distances).
Concrete Examples
Example 1: s = "kitten", t = "sitting". Edit distance = $3$ (substitute k->s, substitute e->i, insert g).
Full $dp$ table (rows indexed by $s$, columns by $t$):
| ε | s | i | t | t | i | n | g | |
|---|---|---|---|---|---|---|---|---|
| ε | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| k | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| i | 2 | 2 | 1 | 2 | 3 | 4 | 5 | 6 |
| t | 3 | 3 | 2 | 1 | 2 | 3 | 4 | 5 |
| t | 4 | 4 | 3 | 2 | 1 | 2 | 3 | 4 |
| e | 5 | 5 | 4 | 3 | 2 | 2 | 3 | 4 |
| n | 6 | 6 | 5 | 4 | 3 | 3 | 2 | 3 |
Answer $dp[m][n] = 3$.
Reconstruction. From $dp[6][7]$, at each step choose the predecessor whose value matches the recurrence. Walking back yields: insert g, match n, match i, substitute e->i, match t, match t, match i, substitute k->s -- reported as an edit script in the original direction.
Example 2: LCS via the same table shape.
$$\mathrm{lcs}[i][j] = \begin{cases} \mathrm{lcs}[i-1][j-1] + 1 & s_i = t_j \ \max\big(\mathrm{lcs}[i-1][j],\ \mathrm{lcs}[i][j-1]\big) & s_i \ne t_j \end{cases}$$
For $s = \texttt{"ABCBDAB"}$, $t = \texttt{"BDCABA"}$, $\mathrm{lcs}[7][6] = 4$, realized by BCBA or BDAB.
Top-down memoized edit distance (same recurrence, top-down form):
@lru_cache(None)
def ed(i, j):
if i == 0: return j
if j == 0: return i
if s[i-1] == t[j-1]:
return ed(i-1, j-1)
return 1 + min(ed(i-1, j), ed(i, j-1), ed(i-1, j-1))
For ed(6,7) with kitten/sitting, only the reachable $(m+1)(n+1)=56$ states get computed -- identical to the tabulated version, but driven by input.
Example 3: Affine gap penalties. A standard model in biological alignment: gap opening cost $o$, gap extension cost $e$, with $o \gg e$. Three DP layers are needed:
$$M[i][j] = \max\big(M[i-1][j-1], X[i-1][j-1], Y[i-1][j-1]\big) + \sigma(s_i, t_j),$$ $$X[i][j] = \max\big(M[i-1][j] - o,\ X[i-1][j] - e\big),$$ $$Y[i][j] = \max\big(M[i][j-1] - o,\ Y[i][j-1] - e\big),$$
where $M$ tracks "ending with a match," $X$ tracks "ending with gap in $t$," $Y$ tracks "ending with gap in $s$." Space still $O(mn)$ per layer but the recurrence shape is a direct generalization of edit distance.
Common Confusion / Misconceptions
"LCS and edit distance are the same." They are not. LCS counts matches, edit distance counts minimum operations. They are related when the cost scheme is specific: $\mathrm{edit}(s, t) = m + n - 2 \cdot \mathrm{lcs}(s, t)$ only when substitutions cost $2$ (effectively a delete + insert), not $1$. Under unit-cost substitution they are different DPs with different recurrences.
"Off-by-one on base cases." Index $0$ must represent the empty prefix, not the first character. Many buggy implementations drop the ε row/column and produce subtly wrong answers. Always size the table as $(m+1) \times (n+1)$.
"Symmetry means direction doesn't matter." Edit distance is symmetric: $\mathrm{edit}(s, t) = \mathrm{edit}(t, s)$. But the reconstruction is direction-sensitive -- transforming $s$ to $t$ emits different edit operations than the reverse. Decide which direction you need.
"Rolling arrays still let you reconstruct." Not directly. Rolling reduces space to $O(\min(m,n))$ but loses the choice-history needed to walk back. For reconstruction in linear space, use Hirschberg's divide-and-conquer: find a middle column match, split, recurse, stitch.
"Damerau-Levenshtein is a minor tweak." With transpositions, the recurrence gains a fourth case: $dp[i-2][j-2] + 1$ when $s_{i-1..i} = t_{j,j-1}$. This is more than cosmetic -- it adds a two-step lookback and changes the reconstruction grammar.
How To Use It
- Allocate a $(m+1) \times (n+1)$ table.
- Initialize $dp[0][j] = j$ and $dp[i][0] = i$.
- Fill row by row, left to right, using the three-way
min. - Read the answer from $dp[m][n]$.
- For reconstruction, at each cell store which of the three branches won, then walk back.
- Space reduction: only the previous row is read, so two rows (or one row plus one scalar for the diagonal term) suffice if you do not need reconstruction.
- For memory-tight alignment with reconstruction, switch to Hirschberg: $O(\min(m,n))$ space, $O(mn)$ time, same answer.
Transfer / Where This Shows Up Later
- S3 Software Design.
diff,patch, and structural merge tools are edit-distance engines with tuned costs (line moves weighted heavier than char edits). The Strategy pattern naturally parameterizes cost functions. - S5 Networking. Fuzzy matching on protocol fields for intrusion detection uses edit-distance variants with bounded bandwidth.
- S6 Databases. Record deduplication and fuzzy joins ("Levenshtein distance < 3") use this exact DP as a similarity metric.
- S7 Architecture. Schema migration planning -- minimum-cost sequence of migrations between two schema versions -- is an edit-distance-in-disguise when migrations have per-step costs.
- Phase 7 AI. Sequence-to-sequence models (Transformers) internalize a soft version of alignment; DTW (dynamic time warping) for time series is edit distance with warp costs. CTC loss in speech recognition is an edit-distance-style marginalization.
Check Yourself
- Write the recurrence from memory, including base cases.
- Why does the diagonal transition cost $0$ when $s_i = t_j$ and $1$ otherwise?
- How would you modify the DP to allow transpositions (swap adjacent characters) at cost 1? What extra state, if any, does it add?
- Prove edit distance is a metric (non-negativity, symmetry, identity, triangle inequality).
- Explain why LCS and edit distance under unit-cost substitution can disagree.
- How does Hirschberg recover the alignment in $O(\min(m,n))$ space?
Mini Drill or Application
- Implement edit distance with a full $(m+1) \times (n+1)$ table and reconstruction that prints the edit script.
- Implement the two-row space-reduced version (no reconstruction). Verify it matches on 20 random inputs.
- Implement LCS with reconstruction.
- Extend edit distance to allow per-operation costs $c_{\text{ins}}, c_{\text{del}}, c_{\text{sub}}$. Verify that setting $c_{\text{sub}} = 2$ and others $= 1$ recovers the LCS-based identity on random strings.
- Implement Needleman-Wunsch alignment with a substitution matrix (BLOSUM-like) and affine gap penalties; align two short amino-acid sequences.
- Implement a 1000-word spell checker that ranks candidates by edit distance to a misspelled input; time it on a 20k-word dictionary.
Read This Only If Stuck
- Skiena: 8.2 Approximate String Matching
- Skiena: 8.2.2 Edit Distance by Dynamic Programming
- Skiena: 8.2.3 Reconstructing the Path
- Skiena: 8.2.4 Varieties of Edit Distance
- CLRS: 14.4 Longest Common Subsequence (Part 1)
- CLRS: 14.4 Longest Common Subsequence (Part 2)
- cp-algorithms: Edit distance
- MIT 6.006 Lec 16: DP Part 2 (LCS, LIS, Coins)