CS440/ECE448 MP 7

CS440/ECE448 Fall 2025

MP 7: HMM Spelling Correction

Due date: Friday October 31st, 11:59pm

For this MP, you will implement spelling correction using an HMM model. Make sure you understand the algorithm before you start writing code, e.g. look at the lectures on Hidden Markov Models and Sections 17.1, 17.2, and 17.4 of Jurafsky and Martin.

General guidelines

Basic instructions are the same as in previous MPs:

For general instructions, see the main MP page and the course policies.
You may use numpy (though it's not needed). You may not use other non-standard modules (including nltk).

Problem Statement

In this MP, you will implement a simplified spelling correction system. For instance, if you type "intellifence", it is likely that you originally meant "intelligence".

Spelling correction, when not considering character insertions or deletions, can be viewed as a sequence labeling problem. The observed sequence is the sequence of characters in the (possibly) misspelled word ("intellifence"), and the hidden sequence is the sequence of characters in the correctly spelled word. Your goal is to find the most likely sequence of hidden characters given the observed characters, which can be done using the Viterbi algorithm learned from the class. MP8 illustration

The provided script for MP7 (mp7.py) reads data from two files (--train and --test). Your viterbi_1() function (in viterbi.py) will be given these data, where it calls the training() function to compute the parameters of your HMM from the training data. It will run the Viterbi algorithm on the test data and return the predicted correct spellings. Note: You are REQUIRED to implement the `viterbi_stepforward()` function, which computes a single column of the Viterbi trellis. This function will be called by your `viterbi_1()` function. We will test the `viterbi_stepforward()` function separately.

The provided script will compute the character error rate (CER) against the ground-truth answer and your HMM's predictions. CER is the sum of inserted, deleted, and substituted characters there are in a word divided by the number of characters in the word, where the lower is better. The autograder will show your CER and the passing threshold for each test case.

You will need to fill in the three functions in viterbi_1.py:

viterbi_1(): HMM-based spelling corrector
viterbi_stepforward(): compute one column of the Viterbi trellis
training(): compute the parameters of your HMM from the training data

MP8 will involve creating an advanced version of the HMM spelling corrector.

The materials

The code package (and data) for the MP (template.zip) contains these code files:

mp7.py (do not change)
utils.py (do not change)
viterbi_1.py
data
test_viterbi.py
test_utils.py (do not change)
test_data

The code package contains the following training and development data:

Small synthetic dataset for debugging:

data/tiny_typos_train.txt
data/tiny_typos_test.txt

Simple Typo Dataset (you only have one character substitution per word):

data/simple_typos_train.txt
data/simple_typos_val.txt

We provide a small example that will be useful for debugging viterbi_1.py in test_viterbi and test_data/test.txt. More details about this example will be provided below (see this section of the instructions) and we strongly recommend you use it.

You should use the provided training data to train the parameters of your model and the development sets to test its accuracy.

To run the code on the data you need to tell it where the data is:

python3 mp7.py --train data/tiny_typos_train.txt --test data/tiny_typos_test.txt

The program will run the algorithm and report one metric:

character error rate

Code Submission

You will be submitting one file to Gradescope: viterbi_1.py. Each of these submit points times out after 10 minutes. Bear in mind that your laptop may be faster than the processor on Gradescope.

The Gradescope autograder will run your code on the development dataset from the supplied template and also a separate (unseen) set of data from the same corpus. In addition, your code will be tested on one or more hidden datasets that are not available to you, which may have different set of characters from the ones provided to you.

Do NOT hardcode any of your important computations, such as transition probabilities and emission probabilities, character set, etc.

Character Set

The following is an example of the set of possible hidden states in your HMM, corresponding to the intended (correct) characters in a word. Remember: you should not hardcode anything about this character set, because we will test your code on other datasets with different alphabets or symbols.

lowercase English letters
digits (if present in the training data)
punctuation marks that may appear within or at the end of words
START and END symbols marking word boundaries

The provided code converts all words to lowercase when it reads in each sentence. It also adds two dummy words at the ends of the sentnce: ^ (START) and $ (END). These tags are just for standardization; they will not be considered in accuracy computations.

Since every word begins with ^, the initial probabilities for your HMM can have a very simple form, but you should not hardcode into your program. Your code also needs to learn the transition probabilities (between correct characters) and the emission probabilities (mapping each correct character to a possibly mistyped observed character). The transitions from ^ encode the likelihood of starting a word with each possible first letter.

Viterbi_1

Your spelling correction system should implement the HMM trellis (Viterbi) decoding algorithm as seen in lecture or Jurafsky and Martin. That is, the probability of each intended character depends only on the previous intended character, and the probability of each observed character depends only on the corresponding intended characters. This model will need to estimate three sets of probabilities:

Initial probabilities (see note about hardcoding initial probabilties in section Character set)
Transition probabilities (how often one correct character follows another such as $ P(l | e) $ in “hello”)
Emission probabilities (how often an intended character produces a particular observed character such as $ P(k | l) $ when “hello” is mistyped as “helko”)

It's helpful to think of your processing in five steps:

Count occurrences of characters, character pairs, and character/observation pairs in the training data.
Compute smoothed probabilities to handle unseen events.
Take the log of each probability to prevent underflow.
Construct the trellis. For each position and hidden character, store both the probability of the best path so far and a pointer to the previous character in that path.
Return the best path through the trellis by backtracing.

You'll need to use smoothing to get good performance. Make sure that your code for computing transition and emission probabilities never returns zero. Laplace smoothing is a good choice for a smoothing method.

For example, to smooth the emission probabilities, consider each intended character individually. For some intended character C, you must ensure that $ P_e(O|C) $ is non-zero for any observed character O (including ones never seen with C in training). You can use Laplace smoothing (as in MP1) to fill in a probability for UNSEEN, to use as the shared probability value for all observed characters O that did not appear with C in training. The emission probabilities $ P_e(O \mid C) $ should sum to 1 when C is fixed and we sum over all observed symbols O (including UNSEEN). Mathematically:

$V_C$ = number of unique observed characters seen in training data for intended character C
$n_C$ = total number of characters seen in training data for intended character C
$P_e($UNSEEN$|C)$ = $ \frac {\alpha}{ n_C + \alpha(V_C+1)} $
$P_e(O|C)$ = $ \frac {\text{count}(O) + \alpha}{ n_C + \alpha(V_C+1)} $

This simple version of Viterbi will get around 8.96% CER on simple_typos_val.py and 0% CER on tiny_typos_test.py. You should write this simple version of Viterbi as the viterbi_1 function in viterbi_1.py.

You will continue to improve your Viterbi implementation in MP8.

DO NOT ATTEMPT TO IMPROVE VITERBI 1.

Writing Viterbi_1 Correctly

To get through all the parts of the MP, your Viterbi implementation must be computing the trellis values correctly and also producing the output character sequence by backtracing. This is very much like the maze search where you searched forwards to get from the start to the goal, and then traced backwards to generate the best path to return. If you are producing the output sequence as you compute trellis values, you're using an incorrect ("greedy") method.

To help you isolate and debug processing steps 4. and 5. of the Viterbi implementation, we have provided a copy of the small example from Jurafsky and Martin (section 8.4.6). As a reminder, the relevant files are:

test_viterbi.py
test_utils.py
test_data

For this example, the initial, transition, and emission probabilities are all provided for you (and you can view them in test_data). This will allow you to test your trellis computation and your backtracking code in isolation (in other words, the part of your viterbi_1(train,test) function that comes after computing the init_prob, emit_prob, trans_prob).

DO NOT SUBMIT TEST_VITERBI TO GRADESCOPE. THIS IS A DEBUGGING TOOL...

... but we highly recommend you use it.

How to use test_viterbi:

Run python3 test_viterbi.py
Visually check the text output. If your terminal has an output like this:
Your Output is: ['^hello$'] Expected Output is: ['^hello$']
then you're in good shape. Continue to work on the assignment and submit your code to Gradescope when you're ready.

When testing on the real dataset, you may find it helpful to use the provided training/validation data. If you look at the emission probabilities or the v array values for a word, the correct character should be one of the top couple values. Obviously unsuitable characters should have much lower values.

You can also print the matrix of transitions. It's not easy to be sure transitions should be highest. However, you can identify some transitions that should have very low values, e.g. transitions from anything to START.

Some additional hints for making your code produce correct answers and run fast enough to complete on the autograder:

Do probability computations using logs. Remember that math.log is faster than numpy.log when dealing with individual calculations.
The trellis is described as an array in the lecture notes, but you can use any data structure to store it. Dictionaries work very well in Python.
Avoid doing random access into python lists. Either use a "for" construction to walk down the list in order, or use some other data structure (e.g. a dictionary).
In principle, the test data could contain a character that you have not seen in the training data. Do not worry about this possibility because there is no way to guess the right answer. In a real linguistic situation, this would probably be due to a labeling error in the test data.
Once you've computed a column of the Viterbi array, you do not need to keep the probability values from the previous columns. It's sufficient to keep the backpointer values.