Spring 2021: Kiran Ramnath, Mahir Morshed Fall 2020: Amnon Attali, Jatin Arora Spring 2020: Jialu Li, Guannan Guo, and Kiran Ramnath Fall 2018: Margaret Fleck, Renxuan Wang, Tiantian Fang, Edward Huang adapted from a U. Penn assignment For this MP, you will implement part of speech (POS) tagging using an HMM model. Make sure you understand the algorithm before you start writing code, e.g. look at the lectures on Hidden Markov Models and Chapter 8 of Jurafsky and Martin. General guidelinesBasic instructions are the same as in previous MPs:
Problem StatementThe mp4 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main mp4 function will compare these against the correct tags and report your accuracy. The data is divided into sentences. Your tagger should process each sentence independently. You will need to write two main types of tagging functions:
The materialsThe code package (and data) for the MP (starter_code.zip) contains code files:
The code package also contains training and development data
The provided code converts all words to lowercase. It also adds a START and END tag for each sentence when it loads the sentence. These tags are just for standardization. They will not be considered in accuracy computation. You should use the provided training data to train the parameters of your model and the development sets to test its accuracy. We will use a separate (unseen) set of data from the same corpus to test your code after you submit it. In addition, your code will be tested on two hidden datasets that are not available to you, which has different number of tags and words from the ones provided to you. So do NOT hardcode any of your important computations, such as initial probabilities, transition probabilities, emission probabilities, number or name of tags, and etc. We will inspect code for hardcoding computations/values and will penalize such implementations.To run the code on the Brown corpus data you need to tell it where the data is and which algorithm to run, either baseline, viterbi_1, viterbi_2, or viterbi_ec: python3 mp4.py --train data/brown-training.txt --test data/brown-dev.txt --algorithm [baseline, viterbi_1, viterbi_2, viterbi_ec] The program will run the algorithm and report three accuracy numbers:
Many words in our datasets have only one possible tag, so it's very hard to get the tag wrong! This means that even very simple algorithms have high overall accuracy. The other two accuracy numbers will help you see where there is room for improvement. TagsetThe following is an example set of 16 part of speech tags. This is the tagset used in the provided Brown corpus. But remember you should not hardcode anything regarding this tagset because we will test your code on two other datasets with a different tagset.
Part 1: Baseline taggerThe Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it should guess the tag that's seen the most often in training dataset. A correctly working baseline tagger should get about 93.9% accuracy on the Brown corpus development set, with over 90% accuracy on multitag words and over 69% on unseen words.
Part 2: Viterbi_1IF YOU FIND THIS PART DIFFICULT TO BEGIN: You may find it easier to go to the next section and first implement only the Viterbi algorithm. The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:
You can assume that all sentences will begin with a START token, whose tag is START. So your initial probabilities will have a very restricted form, whether you choose to handcode appropriate numbers or learn them from the data. The initial probabilities shown in the textbook/texture examples will be handled by transition probabilities from the START token to the first real word. It's helpful to think of your processing in five steps:
You'll need to use smoothing to get good performance. Make sure that your code for computing transition and emission probabilities never returns zero. Laplace smoothing is the method we use to smooth zero probability cases for calculating initial probabilities, transition probabilities, and emission probabilities. For example, to smooth the emission probabilities, consider each tag individually. For a fixed tag T, you need to ensure that \(P_e(W|T)\) produces a non-zero number no matter what word W you give it. You can use Laplace smoothing (as in MP 2) to fill in a probability for "UNKNOWN" which will be the return value for all words W that were not seen in the training data. For this initial implementation of Viterbi, use the same Laplace smoothing constant \(\alpha\) for all tags. This simple version of Viterbi will perform worse than the baseline code for the Brown development dataset (somewhat over 93% accuracy). However you should notice that it's doing better on the multiple-tag words (e.g. over 93.5%). You should write this simple version of Viterbi under viterbi_1 function in viterbi_1.py. Starting or Debugging Part 2 ViterbiTHIS PART IS OPTIONAL AND SHOULD NOT BE SUBMITTED. However in case you find it difficult to directly attempt part 2, you may find it easier to start here. You can first implement or debug your Viterbi implementation alone using the dummy example provided. Jurafsky and Martin (section 8.4.6) provides a great worked out example to visualize the performance of the Viterbi algorithm. We provide that example as a sample test-case for you to ensure that your entire Viterbi algoirthm (Trellis computation and backtracking) is working correctly.
python3 test_viterbi.py
Part 3: Viterbi_2The previous Vitebi tagger fails to beat the baseline because it does very poorly on unseen words. It's assuming that all tags have similar probability for these words, but we know that a new word is much more likely to have the tag NOUN than (say) CONJ. For this part, you'll improve your emission smoothing to match the real probabilities for unseen words. Words that appear zero times in the training data (out-of-vocabulary or OOV words) and words that appear once in the training data (hapax words) tend to have similar parts of speech (POS). For this reason, instead of assuming that OOV words are uniformly distributed across all POS, we can get a much better estimate of their distribution by measuring the distribution of hapax words. Extract these words from the training data and calculate the probability of each tag on them. When you do your Laplace smoothing of the emission probabilities for tag T, scale the Laplace smoothing constant by P(T|hapax), i.e., the probability that tag T occurs given that the word was hapax. Remember that Laplace smoothing acts by reducing probability mass for high-frequency words, and re-assigning some of that probability mass to low-frequency words. A large smoothing constant can end up skewing probability masses a lot, so experiment with small orders of magnitude for this hyperparameter. This optimized version of the Viterbi code should have a significantly better unseen word accuracy on the Brown development dataset, e.g. over 66.5%. It also beat the baseline on overall accuracy (e.g. 95.5%). You should write your optimized version of Viterbi under the viterbi_2 function in viterbi_2.py. The hapax word tag probabilities may be different from one dataset to another, so your viterbi_2 method should compute them dynamically from its training data each time it runs. Hints
Extra Credit: Viterbi_ecThe task for the extra credit is to maximize the accuracy of the Viterbi code. You must train on only the provided training set (no external resources) and you should keep the basic Viterbi algorithm. However, you can make any algorithmic improvements you like. This optimized algorithm should be named viterbi_ec We recommend trying to improve the algorithm's ability to guess the right tag for unseen words. If you examine the set of hapax words in the training data, you should notice that words with certain prefixes and certain suffixes typically have certain limited types of tags. For example, words with suffix "-ly" have several possible tags but the tag distribution is very different from that of the full set of hapax words. You can do a better job of handling these words by changing the emissions probabilities generated for them. Recall what we did in Parts 2 and 3: we mapped hapax words (in the training data) and unseen words (in the development or test data) into a single pseudoword "UNKNOWN". To exploit the form of the word, you can map hapax/unseen words into several different pseudowords. E.g. perhaps all the words ending in "-ing" could be mapped to "X-ING". Then you can use the hapax words to calculate suitable probability values for X-ING, as in Part 3. It is extremely hard to predict useful prefixes and suffixes from first principles. We strongly recommend building yourself a separate python tool to dump the hapax words, with their tags, into a separate file that you can inspect. You may assume that our completely hidden datasets are in English, so that word patterns from the Brown corpus should continue to be useful for the hidden corpus. Using this method, our model solution gets over 76% accuracy on unseen words, and over 96% accuracy overall. (Both numbers on the Brown development dataset.) It may also be possible to improve performance by using two previous tags (rather than just one) to predict each tag. A full version of this idea would use 256 separate tag pairs and may be too slow to run on the autograder. However, you may be able to gain accuracy by using only selected information from the first of the two tags. Also, beam search can be helpful to speed up decoding time. The autograder for parts 3 and ec has a timeout of 10 minutes. Code SubmissionAs in previous MPs, the regular MP and the EC should be submitted separately. The regular MP includes the submission files baseline.py, viterbi_1.py and viterbi_2.py. The EC requires the submission file viterbi_ec.py. |