adapted from a U. Penn assignment For this MP, you will implement part of speech (POS) tagging using an HMM model Make sure you understand the algorithm before you start writing code, e.g. look at lectures 20 and 21 and Chapter 8 of Jurafsky and Martin. General guidelinesBasic instructions are the same as in previous MPs:
Problem StatementThe mp4 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main mp4 function will compare these against the correct tags and report your accuracy. The data is divided into sentences. Your tagger should process each sentence independently. You will need to write two tagging functions:
The materialsThe code package for the MP (tar, zip) contains three files:
You will be submitting a modified version of viterbi.py, which must work with our provided versions of the other two files. The code package also contains three sets of training and development data
The provided code converts all words to lowercase. You should test your code on the all three datasets. You should use the training data to train the parameters of your model and the development sets to test its accuracy. We will use a separate (unseen) set of data to test your code after you submit it. Here is an example of how to run the code on the Brown corpus data: python3 mp4.py --train data/brown-training.txt --test data/brown-dev.txt The program will run both baseline and Viterbi algorithms, reporting three accuracy numbers:
Many words in our datasets have only one possible tag, so it's very hard to get the tag wrong! This means that even very simple algorithms have high overall accuracy. The other two accuracy numbers will help you see where there is room for improvement. TagsetFor this MP, we will use the following set of 16 part of speech tags:
Baseline taggerThe Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it should guess the tag NOUN. A correctly working baseline tagger should get about 94% accuracy on the Brown corpus development set.
Part 1 ViterbiThe Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:
It's helpful to think of your processing in five steps:
You'll need to use smoothing to get good performance. Make sure that your code for computing the three types of probabilities never returns zero. Simple versions of Laplace smoothing should work find for the initial and transition probabilities. The difficult part is handling the emission probabilities To smooth the emission probabilities, consider each tag individually. For a fixed tag T, you need to ensure that Pe(W|T) produces a non-zero number no matter what word W you give it. You can use Laplace smoothing (as in MP 3) to fill in a probability for "UNKNOWN" which will be the return value for all words W that were not seen in the training data. For this initial implementation of Viterbi, use the same Laplace smoothing constant α for all tags. This simple version of Viterbi will perform worse than the baseline code. However you should notice that it's doing better on the multiple-tag words. Part 2 ViterbiThe Part 1 Vitebi tagger fails to beat the baseline because it does very poorly on unseen words. It's assuming that all tags have similar probability for these words, but we know that a new word is much more likely to have the tag NOUN than (say) CONJ. For this part, you'll improve your emission smoothing to match the real probabilities for unseen words. Words that occur only once in the training data ("hapax" words) have a distribution similar to the words that appear only in the test/development data. Extract these words from the training data and calculate the probability of each tag on them. When you do your Laplace smoothing of the emission probabilities for tag T, scale α by the probability of tag T in the set of hapax words. This version of the Viterbi code should have an unseen word accuracy close to that of the baseline algorithm, and a significantly better overall accuracy. The hapax word tag probabilities may be different from one dataset to another. So your Viterbi code should compute them dynamically from its training data each time it runs. Extra CreditThe task for extra credit is to maximize the accuracy of the Viterbi code. You must train on only the provided training set (no external resources) and you should keep the basic Viterbi algorithm. However, you can make any algorithmic improvements you like. We recommend trying to improve the algorithm's ability to guess the right tag for unseen words. If you examine the set of hapax words in the training data, you should notice that words with certain prefixes and certain suffixes typically have certain limited types of tags. For example, words with suffix "-ly" have several possible tags but the tag distribution is very different from that of the full set of hapax words. You can do a better job of handling these words by changing the emissions probabilities generated for them. It is extremely hard to predict useful prefixes and suffixes from first principles. We strongly recommend building yourself a separate python tool to dump the hapax words, with their tags, into a separate file that you can inspect. We've been able to push the accuracy for unseen words above 76% on the Brown corpus data. It may also be possible to improve performance by using two previous tags (rather than just one) to predict each tag. A full version of this idea would use 256 separate tag pairs and may be too slow to run on the autograder. However, you may be able to gain accuracy by using selected information from the first of the two tags, e.g. whether it's a content or function word. We haven't explored these options but you are welcome to do so. Code SubmissionYou should submit your viterbi.py file on gradescope. If you have made local modifications to utils.py and/or mp4.py, make sure that your viterbi.py code works with the original provided versions. There will be three submit points:
You should submit the same code for the Basic and Additional tests. Your submission for the extra credit part may also be the same, or it could be different code. |