Edit Distance and Sequence Alignment

What This Concept Is

Given strings $s$ of length $m$ and $t$ of length $n$, the edit distance (Levenshtein distance) is the minimum number of single-character operations -- insert, delete, substitute -- that transform $s$ into $t$. Let $dp[i][j]$ be the edit distance between the prefix $s[1..i]$ and $t[1..j]$. The recurrence:

$$dp[i][j] = \min\begin{cases} dp[i-1][j] + 1 & \text{delete } s_i \ dp[i][j-1] + 1 & \text{insert } t_j \ dp[i-1][j-1] + [s_i \ne t_j] & \text{match or substitute} \end{cases}$$

with base cases $dp[0][j] = j$ (insert $j$ characters into the empty prefix) and $dp[i][0] = i$ (delete $i$ characters).

Edit distance is the canonical two-sequence / sequence alignment DP. Variants share the same shape -- a 2D table with transitions from three neighboring cells -- but differ in cost structure:

Longest common subsequence (LCS). Counts matches rather than minimizes operations.
Weighted edit distance. Per-operation costs $c_{\text{ins}}, c_{\text{del}}, c_{\text{sub}}$; Skiena 8.2.4.
Needleman-Wunsch global alignment. Biosequence alignment; adds match score and gap penalty.
Smith-Waterman local alignment. Allows "reset to 0," finds the best locally-aligned region.
Affine gap penalties. Opening a gap costs more than extending it; requires three DP layers (match, gap_s, gap_t).
Damerau-Levenshtein. Adds transposition of adjacent characters at cost 1.

Why It Matters Here

Edit distance is the hinge problem for two-sequence DP. Once you understand it, LCS, diff, DNA alignment, spell-checker ranking, and many NLP similarity scores become the same problem with different costs. The 2D table and the three-direction transition are worth internalizing deeply; you will see them transformed many times in later modules.

Its complexity is $O(mn)$ time and $O(mn)$ space for the full table, $O(\min(m, n))$ space with rolling + Hirschberg for reconstruction. For $m = n = 10^5$ this is tight but feasible; above that you need more specialized structures (bit-parallel Myers, banded alignment, or approximate distances).

Concrete Examples

Example 1: s = "kitten", t = "sitting". Edit distance = $3$ (substitute k->s, substitute e->i, insert g).

Full $dp$ table (rows indexed by $s$, columns by $t$):

	ε	s	i	t	t	i	n	g
ε	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

Answer $dp[m][n] = 3$.

Reconstruction. From $dp[6][7]$, at each step choose the predecessor whose value matches the recurrence. Walking back yields: insert g, match n, match i, substitute e->i, match t, match t, match i, substitute k->s -- reported as an edit script in the original direction.

Example 2: LCS via the same table shape.

$$\mathrm{lcs}[i][j] = \begin{cases} \mathrm{lcs}[i-1][j-1] + 1 & s_i = t_j \ \max\big(\mathrm{lcs}[i-1][j],\ \mathrm{lcs}[i][j-1]\big) & s_i \ne t_j \end{cases}$$

For $s = \texttt{"ABCBDAB"}$, $t = \texttt{"BDCABA"}$, $\mathrm{lcs}[7][6] = 4$, realized by BCBA or BDAB.

Top-down memoized edit distance (same recurrence, top-down form):

@lru_cache(None)
def ed(i, j):
    if i == 0: return j
    if j == 0: return i
    if s[i-1] == t[j-1]:
        return ed(i-1, j-1)
    return 1 + min(ed(i-1, j), ed(i, j-1), ed(i-1, j-1))

For ed(6,7) with kitten/sitting, only the reachable $(m+1)(n+1)=56$ states get computed -- identical to the tabulated version, but driven by input.

Example 3: Affine gap penalties. A standard model in biological alignment: gap opening cost $o$, gap extension cost $e$, with $o \gg e$. Three DP layers are needed:

$$M[i][j] = \max\big(M[i-1][j-1], X[i-1][j-1], Y[i-1][j-1]\big) + \sigma(s_i, t_j),$$ $$X[i][j] = \max\big(M[i-1][j] - o,\ X[i-1][j] - e\big),$$ $$Y[i][j] = \max\big(M[i][j-1] - o,\ Y[i][j-1] - e\big),$$

where $M$ tracks "ending with a match," $X$ tracks "ending with gap in $t$," $Y$ tracks "ending with gap in $s$." Space still $O(mn)$ per layer but the recurrence shape is a direct generalization of edit distance.

Common Confusion / Misconceptions

"LCS and edit distance are the same." They are not. LCS counts matches, edit distance counts minimum operations. They are related when the cost scheme is specific: $\mathrm{edit}(s, t) = m + n - 2 \cdot \mathrm{lcs}(s, t)$ only when substitutions cost $2$ (effectively a delete + insert), not $1$. Under unit-cost substitution they are different DPs with different recurrences.

"Off-by-one on base cases." Index $0$ must represent the empty prefix, not the first character. Many buggy implementations drop the ε row/column and produce subtly wrong answers. Always size the table as $(m+1) \times (n+1)$.

"Symmetry means direction doesn't matter." Edit distance is symmetric: $\mathrm{edit}(s, t) = \mathrm{edit}(t, s)$. But the reconstruction is direction-sensitive -- transforming $s$ to $t$ emits different edit operations than the reverse. Decide which direction you need.

"Rolling arrays still let you reconstruct." Not directly. Rolling reduces space to $O(\min(m,n))$ but loses the choice-history needed to walk back. For reconstruction in linear space, use Hirschberg's divide-and-conquer: find a middle column match, split, recurse, stitch.

"Damerau-Levenshtein is a minor tweak." With transpositions, the recurrence gains a fourth case: $dp[i-2][j-2] + 1$ when $s_{i-1..i} = t_{j,j-1}$. This is more than cosmetic -- it adds a two-step lookback and changes the reconstruction grammar.

How To Use It

Allocate a $(m+1) \times (n+1)$ table.
Initialize $dp[0][j] = j$ and $dp[i][0] = i$.
Fill row by row, left to right, using the three-way min.
Read the answer from $dp[m][n]$.
For reconstruction, at each cell store which of the three branches won, then walk back.
Space reduction: only the previous row is read, so two rows (or one row plus one scalar for the diagonal term) suffice if you do not need reconstruction.
For memory-tight alignment with reconstruction, switch to Hirschberg: $O(\min(m,n))$ space, $O(mn)$ time, same answer.

Transfer / Where This Shows Up Later

S3 Software Design. diff, patch, and structural merge tools are edit-distance engines with tuned costs (line moves weighted heavier than char edits). The Strategy pattern naturally parameterizes cost functions.
S5 Networking. Fuzzy matching on protocol fields for intrusion detection uses edit-distance variants with bounded bandwidth.
S6 Databases. Record deduplication and fuzzy joins ("Levenshtein distance < 3") use this exact DP as a similarity metric.
S7 Architecture. Schema migration planning -- minimum-cost sequence of migrations between two schema versions -- is an edit-distance-in-disguise when migrations have per-step costs.
Phase 7 AI. Sequence-to-sequence models (Transformers) internalize a soft version of alignment; DTW (dynamic time warping) for time series is edit distance with warp costs. CTC loss in speech recognition is an edit-distance-style marginalization.

Check Yourself

Write the recurrence from memory, including base cases.
Why does the diagonal transition cost $0$ when $s_i = t_j$ and $1$ otherwise?
How would you modify the DP to allow transpositions (swap adjacent characters) at cost 1? What extra state, if any, does it add?
Prove edit distance is a metric (non-negativity, symmetry, identity, triangle inequality).
Explain why LCS and edit distance under unit-cost substitution can disagree.
How does Hirschberg recover the alignment in $O(\min(m,n))$ space?

Mini Drill or Application

Implement edit distance with a full $(m+1) \times (n+1)$ table and reconstruction that prints the edit script.
Implement the two-row space-reduced version (no reconstruction). Verify it matches on 20 random inputs.
Implement LCS with reconstruction.
Extend edit distance to allow per-operation costs $c_{\text{ins}}, c_{\text{del}}, c_{\text{sub}}$. Verify that setting $c_{\text{sub}} = 2$ and others $= 1$ recovers the LCS-based identity on random strings.
Implement Needleman-Wunsch alignment with a substitution matrix (BLOSUM-like) and affine gap penalties; align two short amino-acid sequences.
Implement a 1000-word spell checker that ranks candidates by edit distance to a misspelled input; time it on a 20k-word dictionary.

	ε	s	i	t	t	i	n	g
ε	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

	ε	s	i	t	t	i	n	g
ε	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

What This Concept Is​

Why It Matters Here​

Concrete Examples​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​