 CS 440/ECE 448
Margaret Fleck

## Edit distance

A good search algorithm must be coupled with a good representation of the problem. So transforming to a different state graph model may dramatically improve performance. We'll illustrate this with finding edit distance, and then see another example when we get to classical planning.

Problem: find the best alignment between two strings, e.g. "ALGORITHM" and "ALTRUISTIC". By "best," we mean the alignment that requires the fewest editing operations (delete char, add char, substitute char) to transform the first word into the second. We can get from ALGORITHM to ALTRUISTIC with six edits, producing the following optimal alignment:

     A L G O R   I   T H M
A L   T R U I S T I C
D S   A   A   S S

D = delete character
S = substitute


Versions of this task appear in

• theory classes (example of dynamic programming)
• computational biology (matching DNA sequences)
• evaluating speech recognizers (comparing machine transcription to correct one)
• machine translation (aligning text in two different languages to match related words)

### Naive method

Each state is a string. So the start state would be "algorithm" and the end state is "altruistic." Each action applies one edit operation, anywhere in the word.

In this model, each state has a very large number of successors. So there are about $$28^9$$ single-character changes you could make to ALGORITHM, resulting in words like:

    LGORITHM, AGORITHM, ..., ALGORIHM, ...
BLGORITHM, CLGORITHM, ....
AAGORITHM, ....
...
ALBORJTHM, ...
....


This massive fan-out will cause search to be very slow.

### Better method

A better representation for this problem explores the space of partial matches. Each state is an alignment of the initial sections (prefixes) of the two strings. So the start state is an alignment of two empty strings, and the end state is an alignment of the full strings.

In this representation, each action extends the alignment by one character. From each state, there are only three legal moves:

• pair the next characters in the two strings
• skip a character in the first string
• skip a character in the second string

Here's part of the search tree, showing how we extend the alignment with each step. (A dot represents a skipped character.) In this specific case, we can pack our work into a nifty compact 2D table.

This algorithm can be adapted for situations where the search space is too large to maintain the full alignment table. for example, we may need to match DNA sequences that are $$10^5$$ characters long.

• We can an approximate search in which we keep only the best k alignments of each length (e.g. speech recognition).
• We can compute the edit distance but not the optimal alignment, which requires storing only two columns from the table.

For more detail, see Wikipedia on edit distance and Jeff Erickson's notes.