CS440/ECE448 Assignment 4

CS440/ECE448 Fall 2018

Assignment 5: HMM POS tagging

Due date: Monday October 28th, 11:55pm

Margaret Fleck, Renxuan Wang, Tiantian Fang, Edward Huang
adapted from a U. Penn assignment

For this MP, you will implement part of speech (POS) tagging using an HMM model Make sure you understand the algorithm before you start writing code, e.g. look at lectures 20 and 21 and Chapter 8 of Jurafsky and Martin.

General guidelines

Basic instructions are the same as in previous MPs:

For general instructions, see the main MP page and the course policies.
Extra credit is available, capped at 10%.
You may use numpy (though it's not needed). You may not use other non-standard modules (including nltk).

Problem Statement

The mp4 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main mp4 function will compare these against the correct tags and report your accuracy.

The data is divided into sentences. Your tagger should process each sentence independently.

You will need to write two tagging functions:

Baseline
Viterbi: HMM tagger

The materials

The code package for the MP (tar, zip) contains three files:

mp4.py
utils.py
viterbi.py

You will be submitting a modified version of viterbi.py, which must work with our provided versions of the other two files.

The code package also contains three sets of training and development data

Brown corpus
MASC corpus (from the Open American National Corpus)
Both, i.e. the union of the two previous datasets

The provided code converts all words to lowercase. You should test your code on the all three datasets.

You should use the training data to train the parameters of your model and the development sets to test its accuracy. We will use a separate (unseen) set of data to test your code after you submit it.

Here is an example of how to run the code on the Brown corpus data:

python3 mp4.py --train data/brown-training.txt --test data/brown-dev.txt

The program will run both baseline and Viterbi algorithms, reporting three accuracy numbers:

overall
on words that have been seen with multiple different tags
on unseen words

Many words in our datasets have only one possible tag, so it's very hard to get the tag wrong! This means that even very simple algorithms have high overall accuracy. The other two accuracy numbers will help you see where there is room for improvement.

Tagset

For this MP, we will use the following set of 16 part of speech tags:

ADJ adjective
ADV adverb
IN preposition
PART particle (e.g. after verb, looks like a preposition)
PRON pronoun
NUM number
CONJ conjunction
UH filler, exclamation
TO infinitive
VERB verb
MODAL modal verb
DET determiner
NOUN noun
PERIOD end of sentence punctuation
PUNCT other punctuation
X miscellaneous hard-to-classify items

Baseline tagger

The Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it should guess the tag NOUN.

A correctly working baseline tagger should get about 94% accuracy on the Brown corpus development set.

DO NOT ATTEMPT TO IMPROVE THE BASELINE CODE.

Part 1 Viterbi

The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:

Initial probabilities (How often does each tag occur at the start of a sentence?)
Transition probabilities (How often does tag \(t_b\) follow tag \(t_a\)?)
Emission probabilities (How often does tag t yield word w?)

It's helpful to think of your processing in five steps:

Count occurrences of tags, tag pairs, tag/word pairs.
Compute smoothed probabilities
Take the log of each probability
Construct the trellis. Notice that for each tag/time pair, you must store not only the probability of the best path but also a pointer to the previous tag/time pair in that path.
Return the best path through the trellis.

You'll need to use smoothing to get good performance. Make sure that your code for computing the three types of probabilities never returns zero. Simple versions of Laplace smoothing should work find for the initial and transition probabilities. The difficult part is handling the emission probabilities

To smooth the emission probabilities, consider each tag individually. For a fixed tag T, you need to ensure that \(P_e(W|T)\) produces a non-zero number no matter what word W you give it. You can use Laplace smoothing (as in MP 3) to fill in a probability for "UNKNOWN" which will be the return value for all words W that were not seen in the training data. For this initial implementation of Viterbi, use the same Laplace smoothing constant \(\alpha\) for all tags.

This simple version of Viterbi will perform worse than the baseline code. However you should notice that it's doing better on the multiple-tag words.

Part 2 Viterbi

The Part 1 Vitebi tagger fails to beat the baseline because it does very poorly on unseen words. It's assuming that all tags have similar probability for these words, but we know that a new word is much more likely to have the tag NOUN than (say) CONJ. For this part, you'll improve your emission smoothing to match the real probabilities for unseen words.

Words that occur only once in the training data ("hapax" words) have a distribution similar to the words that appear only in the test/development data. Extract these words from the training data and calculate the probability of each tag on them. When you do your Laplace smoothing of the emission probabilities for tag T, scale \(\alpha\) by the probability of tag T in the set of hapax words. This version of the Viterbi code should have an unseen word accuracy close to that of the baseline algorithm, and a significantly better overall accuracy.

The hapax word tag probabilities may be different from one dataset to another. So your Viterbi code should compute them dynamically from its training data each time it runs.

Extra Credit

The task for extra credit is to maximize the accuracy of the Viterbi code. You must train on only the provided training set (no external resources) and you should keep the basic Viterbi algorithm. However, you can make any algorithmic improvements you like.

We recommend trying to improve the algorithm's ability to guess the right tag for unseen words. If you examine the set of hapax words in the training data, you should notice that words with certain prefixes and certain suffixes typically have certain limited types of tags. For example, words with suffix "-ly" have several possible tags but the tag distribution is very different from that of the full set of hapax words. You can do a better job of handling these words by changing the emissions probabilities generated for them.

It is extremely hard to predict useful prefixes and suffixes from first principles. We strongly recommend building yourself a separate python tool to dump the hapax words, with their tags, into a separate file that you can inspect.

We've been able to push the accuracy for unseen words above 76% on the Brown corpus data.

It may also be possible to improve performance by using two previous tags (rather than just one) to predict each tag. A full version of this idea would use 256 separate tag pairs and may be too slow to run on the autograder. However, you may be able to gain accuracy by using selected information from the first of the two tags, e.g. whether it's a content or function word. We haven't explored these options but you are welcome to do so.

Code Submission

You should submit your viterbi.py file on gradescope. If you have made local modifications to utils.py and/or mp4.py, make sure that your viterbi.py code works with the original provided versions.

There will be three submit points:

Basic tests (quick but incomplete)
Additional tests
Extra credit (with optional leaderboard)

You should submit the same code for the Basic and Additional tests. Your submission for the extra credit part may also be the same, or it could be different code.