CS440/ECE448 MP 7

Loading [MathJax]/jax/output/HTML-CSS/jax.js

CS440/ECE448 Fall 2023

MP 7: HMM POS tagging

Due date: Monday October 23rd, 11:59pm

For this MP, you will implement part of speech (POS) tagging using an HMM model. Make sure you understand the algorithm before you start writing code, e.g. look at the lectures on Hidden Markov Models and Chapter 8 of Jurafsky and Martin.

General guidelines

Basic instructions are the same as in previous MPs:

For general instructions, see the main MP page and the course policies.
You may use numpy (though it's not needed). You may not use other non-standard modules (including nltk).

Problem Statement

The MP7 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main mp7 function will compare these against the correct tags and report your accuracy.

The data is divided into sentences. Your tagger should process each sentence independently.

You will need to write two tagging functions:

Baseline
Viterbi: HMM tagger (version 1)

Mp8 will involve creating 2 more versions of the HMM tagger.

The materials

The code package (and data) for the MP (template.zip) contains these code files:

mp7.py (do not change)
utils.py (do not change)
baseline.py
viterbi/viterbi_1.py
test_viterbi
- test_viterbi.py
- utils.py (do not change)

The code package contains the following training and development data:

Brown corpus:

data/brown-training.txt
data/brown-dev.txt

Small synthetic testing dataset (for test_viterbi.py):

data/mttest-training.txt
data/mttest-dev.txt

We provide a small example that will be useful for debugging viterbi_1.py in test_viterbi and data/mttext-*.txt. More details about this example will be provided below (see this section of the instructions) and we strongly recommend you use it.

You should use the provided training data to train the parameters of your model and the development sets to test its accuracy.

To run the code on the Brown corpus data you need to tell it where the data is and which algorithm to run, either baseline or viterbi_1:

python3 mp7.py --train data/brown-training.txt --test data/brown-dev.txt --algorithm [baseline, viterbi_1]

The program will run the algorithm and report three accuracy numbers:

overall accuracy
accuracy on words that have been seen with multiple different tags
accuracy on unseen words

Many words in our datasets have only one possible tag, so it's very hard to get the tag wrong! This means that even very simple algorithms have high overall accuracy. The other two accuracy numbers will help you see where there is room for improvement.

Code Submission

You will be submitting two files to Gradescope: baseline.py and viterbi_1.py. Each of these submit points times out after 10 minutes. Bear in mind that your laptop may be faster than the processor on Gradescope.

The Gradescope autograder will run your code on the development dataset from the supplied template and also a separate (unseen) set of data from the same corpus. In addition, your code will be tested on one or more hidden datasets that are not available to you, which may have different number of tags and words from the ones provided to you.

Do NOT hardcode any of your important computations, such as transition probabilities and emission probabilities, number or name of tags, and etc.

Tagset

The following is an example set of 16 part of speech tags. This is the tagset used in the provided Brown corpus. Remember, you should not hardcode anything regarding this tagset because we will test your code on another dataset with a different tagset.

ADJ adjective
ADV adverb
IN preposition
PART particle (e.g. after verb, looks like a preposition)
PRON pronoun
NUM number
CONJ conjunction
UH filler, exclamation
TO infinitive
VERB verb
MODAL modal verb
DET determiner
NOUN noun
PERIOD end of sentence punctuation
PUNCT other punctuation
X miscellaneous hard-to-classify items

The provided code converts all words to lowercase when it reads in each sentence. It also adds two dummy words at the ends of the sentnce: START (tag START) and END (tag END). These tags are just for standardization; they will not be considered in accuracy computations.

Since the first word in each input sentence is always START, the initial probabilities for your HMM can have a very simple form, which you can hardcode into your program. Your code only needs to learn the transition probabilities and the emission probabilities. The transition probabilities from START to the next tag will encode the probability of starting a sentence with the various different tags.

Baseline tagger

The Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it should guess the tag that's seen the most often in training dataset.

A correctly working baseline tagger should get about 93.9% accuracy on the Brown corpus development set, with over 90% accuracy on multitag words and over 69% on unseen words.

DO NOT IMPROVE THE BASELINE CODE.

Viterbi_1

The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algorithm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:

Initial probabilities (see note about hardcoding initial probabilties in section Tagset)
Transition probabilities (How often does tag $t_b$ follow tag $t_a$ ?)
Emission probabilities (How often does tag t yield word w?)

It's helpful to think of your processing in five steps:

Count occurrences of tags, tag pairs, tag/word pairs.
Compute smoothed probabilities.
Take the log of each probability.
Construct the trellis. Notice that for each tag/time pair, you must store not only the probability of the best path but also a pointer to the previous tag/time pair in that path.
Return the best path through the trellis by backtracing.

You'll need to use smoothing to get good performance. Make sure that your code for computing transition and emission probabilities never returns zero. Laplace smoothing is a good choice for a smoothing method.

For example, to smooth the emission probabilities, consider each tag individually. For some tag T, you need to ensure that $P_e(W|T)$ produces a non-zero number no matter what word W you give it. You can use Laplace smoothing (as in MP 1) to fill in a probability for "UNKNOWN," to use as the shared probability value for all words W that were not seen in the training data for said tag T. The emission probabilities $P_e(W|T)$ should add up to 1 when we keep T fixed but sum over all words W (including UNKNOWN). Mathematically,

$V_T$ = number of unique words seen in training data for tag T
$n_T$ = total number of words in training data for tag T
$P_e($ UNKNOWN $|T)$ = $\frac {\alpha}{ n_T + \alpha(V_T+1)}$
$P_e(W|T)$ = $\frac {\text{count}(W) + \alpha}{ n_T + \alpha(V_T+1)}$

Now repeat the Laplace smoothing process to smooth the emission probabilities for all the other tags, one tag at a time. For this initial implementation of Viterbi, you will use the same Laplace smoothing constant $\alpha$ for all tags.

Similarly, to smooth the transition probabilities, consider one specific tag T. Use Laplace smoothing to fill any zeroes in the probabilities of which tags can follow T. The transition probabilities $P_e(T_b|T)$ should add up to 1 when we keep T constant and sum across all following tags $T_b$ . Now repeat this process, replacing T with all the other possible first tags. The Laplace smoothing constant $\alpha$ should be the same for all first tags, but it might be different from the constant that you used for the emission probabilities.

This simple version of Viterbi will perform worse than the baseline code for the Brown development dataset (somewhat over 93% accuracy). However, you should notice that it's doing better on the multiple-tag words (e.g. over 93.5%). You should write this simple version of Viterbi as the viterbi_1 function in viterbi_1.py.

You will continue to improve your Viterbi implementation in MP8.

DO NOT ATTEMPT TO IMPROVE VITERBI 1.

Writing Viterbi_1 Correctly

To get through all the parts of the MP, your Viterbi implementation must be computing the trellis values correctly and also producing the output tag sequence by backtracing. This is very much like the maze search in MP1, where you searched forwards to get from the start to the goal, and then traced backwards to generate the best path to return. If you are producing the output POS sequence as you compute trellis values, you're using an incorrect ("greedy") method.

To help you isolate and debug processing steps 4. and 5. of the Viterbi implementation, we have provided a copy of the small example from Jurafsky and Martin (section 8.4.6). As a reminder, the relevant files are:

test_viterbi/*
data/mttest-*.txt

For this example, the initial, transition, and emission probabilities are all provided for you (and you can view them in mttest-training.txt). This will allow you to test your trellis computation and your backtracking code in isolation (in other words, the part of your viterbi_1(train,test) function that comes after computing the init_prob, emit_prob, trans_prob).

DO NOT SUBMIT TEST_VITERBI TO GRADESCOPE. THIS IS A DEBUGGING TOOL...

... but we highly recommend you use it.

How to use test_viterbi:

Copy your file viterbi_1.py into the test_viterbi directory.
Run cd test_viterbi
Run python3 test_viterbi.py

Visually check the text output. If your terminal has an output like this:


          Your Output is: [[('Janet', 'NNP'), ('will', 'MD'), ('back', 'VB'), ('the', 'DT'), ('bill', 'NN')]] 
Expected Output is: [("Janet","NNP"), ("will","MD"), ("back","VB"), ("the","DT"), ("bill","NN")]

then you're in good shape. Continue to work on the assignment and submit your code to Gradescope when you're ready.

When testing on the real dataset, you may find it helpful to make a very small file of test sentences. You can then print out the v and b values for each word. If you look at the emission probabilities or the v array values for a word, the correct tag should be one of the top couple values. Obviously unsuitable tags should have much lower values.

You can also print the matrix of tag transitions. It's not easy to be sure transitions should be highest. However, you can identify some transitions that should have very low values, e.g. transitions from anything to the START tag or to the X tag.

Some additional hints for making your code produce correct answers and run fast enough to complete on the autograder:

Do probability computations using logs. Remember that math.log is faster than numpy.log when dealing with individual calculations.
The trellis is described as an array in the lecture notes, but you can use any data structure to store it. Dictionaries work very well in Python.
Avoid doing random access into python lists. Either use a "for" construction to walk down the list in order, or use some other data structure (e.g. a dictionary).
In principle, the test data could contain a tag that you have not seen in the training data. Do not worry about this possibility because there is no way to guess the right answer. In a real linguistic situation, this would probably be due to a tagging error in the test data.
If you suspect a bug in your backtracing code, it may be helpful to temporarily replace backtracing with the greedy method.
Once you've computed a column of the Viterbi array, you do not need to keep the probability values from the previous columns. It's sufficient to keep the backpointer values.