Skip to main content

Edit Distance and Sequence Alignment

What This Concept Is

Given strings $s$ of length $m$ and $t$ of length $n$, the edit distance (Levenshtein distance) is the minimum number of single-character operations -- insert, delete, substitute -- that transform $s$ into $t$. Let $dp[i][j]$ be the edit distance between the prefix $s[1..i]$ and $t[1..j]$. The recurrence:

$$dp[i][j] = \min\begin{cases} dp[i-1][j] + 1 & \text{delete } s_i \ dp[i][j-1] + 1 & \text{insert } t_j \ dp[i-1][j-1] + [s_i \ne t_j] & \text{match or substitute} \end{cases}$$

with base cases $dp[0][j] = j$ (insert $j$ characters into the empty prefix) and $dp[i][0] = i$ (delete $i$ characters).

Edit distance is the canonical two-sequence / sequence alignment DP. Variants share the same shape -- a 2D table with transitions from three neighboring cells -- but differ in cost structure:

  • Longest common subsequence (LCS). Counts matches rather than minimizes operations.
  • Weighted edit distance. Per-operation costs $c_{\text{ins}}, c_{\text{del}}, c_{\text{sub}}$; Skiena 8.2.4.
  • Needleman-Wunsch global alignment. Biosequence alignment; adds match score and gap penalty.
  • Smith-Waterman local alignment. Allows "reset to 0," finds the best locally-aligned region.
  • Affine gap penalties. Opening a gap costs more than extending it; requires three DP layers (match, gap_s, gap_t).
  • Damerau-Levenshtein. Adds transposition of adjacent characters at cost 1.

Why It Matters Here

Edit distance is the hinge problem for two-sequence DP. Once you understand it, LCS, diff, DNA alignment, spell-checker ranking, and many NLP similarity scores become the same problem with different costs. The 2D table and the three-direction transition are worth internalizing deeply; you will see them transformed many times in later modules.

Its complexity is $O(mn)$ time and $O(mn)$ space for the full table, $O(\min(m, n))$ space with rolling + Hirschberg for reconstruction. For $m = n = 10^5$ this is tight but feasible; above that you need more specialized structures (bit-parallel Myers, banded alignment, or approximate distances).

Concrete Examples

Example 1: s = "kitten", t = "sitting". Edit distance = $3$ (substitute k->s, substitute e->i, insert g).

Full $dp$ table (rows indexed by $s$, columns by $t$):

εsitting
ε01234567
k11234567
i22123456
t33212345
t44321234
e55432234
n66543323

Answer $dp[m][n] = 3$.

Reconstruction. From $dp[6][7]$, at each step choose the predecessor whose value matches the recurrence. Walking back yields: insert g, match n, match i, substitute e->i, match t, match t, match i, substitute k->s -- reported as an edit script in the original direction.

Example 2: LCS via the same table shape.

$$\mathrm{lcs}[i][j] = \begin{cases} \mathrm{lcs}[i-1][j-1] + 1 & s_i = t_j \ \max\big(\mathrm{lcs}[i-1][j],\ \mathrm{lcs}[i][j-1]\big) & s_i \ne t_j \end{cases}$$

For $s = \texttt{"ABCBDAB"}$, $t = \texttt{"BDCABA"}$, $\mathrm{lcs}[7][6] = 4$, realized by BCBA or BDAB.

Top-down memoized edit distance (same recurrence, top-down form):

@lru_cache(None)
def ed(i, j):
if i == 0: return j
if j == 0: return i
if s[i-1] == t[j-1]:
return ed(i-1, j-1)
return 1 + min(ed(i-1, j), ed(i, j-1), ed(i-1, j-1))

For ed(6,7) with kitten/sitting, only the reachable $(m+1)(n+1)=56$ states get computed -- identical to the tabulated version, but driven by input.

Example 3: Affine gap penalties. A standard model in biological alignment: gap opening cost $o$, gap extension cost $e$, with $o \gg e$. Three DP layers are needed:

$$M[i][j] = \max\big(M[i-1][j-1], X[i-1][j-1], Y[i-1][j-1]\big) + \sigma(s_i, t_j),$$ $$X[i][j] = \max\big(M[i-1][j] - o,\ X[i-1][j] - e\big),$$ $$Y[i][j] = \max\big(M[i][j-1] - o,\ Y[i][j-1] - e\big),$$

where $M$ tracks "ending with a match," $X$ tracks "ending with gap in $t$," $Y$ tracks "ending with gap in $s$." Space still $O(mn)$ per layer but the recurrence shape is a direct generalization of edit distance.

Common Confusion / Misconceptions

"LCS and edit distance are the same." They are not. LCS counts matches, edit distance counts minimum operations. They are related when the cost scheme is specific: $\mathrm{edit}(s, t) = m + n - 2 \cdot \mathrm{lcs}(s, t)$ only when substitutions cost $2$ (effectively a delete + insert), not $1$. Under unit-cost substitution they are different DPs with different recurrences.

"Off-by-one on base cases." Index $0$ must represent the empty prefix, not the first character. Many buggy implementations drop the ε row/column and produce subtly wrong answers. Always size the table as $(m+1) \times (n+1)$.

"Symmetry means direction doesn't matter." Edit distance is symmetric: $\mathrm{edit}(s, t) = \mathrm{edit}(t, s)$. But the reconstruction is direction-sensitive -- transforming $s$ to $t$ emits different edit operations than the reverse. Decide which direction you need.

"Rolling arrays still let you reconstruct." Not directly. Rolling reduces space to $O(\min(m,n))$ but loses the choice-history needed to walk back. For reconstruction in linear space, use Hirschberg's divide-and-conquer: find a middle column match, split, recurse, stitch.

"Damerau-Levenshtein is a minor tweak." With transpositions, the recurrence gains a fourth case: $dp[i-2][j-2] + 1$ when $s_{i-1..i} = t_{j,j-1}$. This is more than cosmetic -- it adds a two-step lookback and changes the reconstruction grammar.

How To Use It

  1. Allocate a $(m+1) \times (n+1)$ table.
  2. Initialize $dp[0][j] = j$ and $dp[i][0] = i$.
  3. Fill row by row, left to right, using the three-way min.
  4. Read the answer from $dp[m][n]$.
  5. For reconstruction, at each cell store which of the three branches won, then walk back.
  6. Space reduction: only the previous row is read, so two rows (or one row plus one scalar for the diagonal term) suffice if you do not need reconstruction.
  7. For memory-tight alignment with reconstruction, switch to Hirschberg: $O(\min(m,n))$ space, $O(mn)$ time, same answer.

Transfer / Where This Shows Up Later

  • S3 Software Design. diff, patch, and structural merge tools are edit-distance engines with tuned costs (line moves weighted heavier than char edits). The Strategy pattern naturally parameterizes cost functions.
  • S5 Networking. Fuzzy matching on protocol fields for intrusion detection uses edit-distance variants with bounded bandwidth.
  • S6 Databases. Record deduplication and fuzzy joins ("Levenshtein distance < 3") use this exact DP as a similarity metric.
  • S7 Architecture. Schema migration planning -- minimum-cost sequence of migrations between two schema versions -- is an edit-distance-in-disguise when migrations have per-step costs.
  • Phase 7 AI. Sequence-to-sequence models (Transformers) internalize a soft version of alignment; DTW (dynamic time warping) for time series is edit distance with warp costs. CTC loss in speech recognition is an edit-distance-style marginalization.

Check Yourself

  1. Write the recurrence from memory, including base cases.
  2. Why does the diagonal transition cost $0$ when $s_i = t_j$ and $1$ otherwise?
  3. How would you modify the DP to allow transpositions (swap adjacent characters) at cost 1? What extra state, if any, does it add?
  4. Prove edit distance is a metric (non-negativity, symmetry, identity, triangle inequality).
  5. Explain why LCS and edit distance under unit-cost substitution can disagree.
  6. How does Hirschberg recover the alignment in $O(\min(m,n))$ space?

Mini Drill or Application

  1. Implement edit distance with a full $(m+1) \times (n+1)$ table and reconstruction that prints the edit script.
  2. Implement the two-row space-reduced version (no reconstruction). Verify it matches on 20 random inputs.
  3. Implement LCS with reconstruction.
  4. Extend edit distance to allow per-operation costs $c_{\text{ins}}, c_{\text{del}}, c_{\text{sub}}$. Verify that setting $c_{\text{sub}} = 2$ and others $= 1$ recovers the LCS-based identity on random strings.
  5. Implement Needleman-Wunsch alignment with a substitution matrix (BLOSUM-like) and affine gap penalties; align two short amino-acid sequences.
  6. Implement a 1000-word spell checker that ranks candidates by edit distance to a misspelled input; time it on a 20k-word dictionary.

Read This Only If Stuck