CS440/ECE448 MP 8

Processing math: 100%

CS440/ECE448 Fall 2024

MP 8: HMM POS tagging

Due date: Monday October 28th, 11:59pm

For this MP, improve your (POS) tagging model from MP7. Make sure you understand the algorithm before you start writing code, e.g. look at the lectures on Hidden Markov Models and Sections 17.1, 17.2, and 17.4 of Jurafsky and Martin.

General guidelines

Basic instructions are the same as in previous MPs:

For general instructions, see the main MP page and the course policies.
You may use numpy (though it's not needed). You may not use other non-standard modules (including nltk).

Problem Statement

The MP8 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main MP8 function will compare these against the correct tags and report your accuracy.

The data is divided into sentences. Your tagger should process each sentence independently.

You will need to write two additional modifications to the Viterbi tagger you wrote in MP7.

The materials

The code package (and data) for the MP (template.zip) contains these code files:

mp8.py (do not change)
utils.py (do not change)
viterbi_2.py
viterbi_3.py

It also contains the same brown corpus and mttest datasets that were in the MP 7 template package.

To run the code on the Brown corpus data you need to tell it where the data is and which algorithm to run, either baseline/viterbi_1 from MP7, or viterbi_2/viterbi_3:

python3 mp8.py --train data/brown-training.txt --test data/brown-dev.txt --algorithm [viterbi_2, viterbi_3]

The program will run the algorithm and report three accuracy numbers:

overall accuracy
accuracy on words that have been seen with multiple different tags
accuracy on unseen words

Many words in our datasets have only one possible tag, so it's very hard to get the tag wrong! This means that even very simple algorithms have high overall accuracy. The other two accuracy numbers will help you see where there is room for improvement.

Code Submission

You will be submitting two files to Gradescope: viterbi_2.py and viterbi_3.py.

The Gradescope autograder will run your code on data from the Brown corpus, possibly including test data that wasn't in the template package. In addition, your code will be tested on one or more hidden datasets that are not available to you. This data is guaranteed to be in English and generally similar to the Brown corpus, but it may have a different number of tags and different words from the ones provided to you.

Do NOT hardcode any of your important computations, such as transition probabilities and emission probabilities, number or name of tags, and etc.

Tagset

The following is an example set of 16 part of speech tags. This is the tagset used in the provided Brown corpus. But remember you should not hardcode anything regarding this tagset because we will test your code on another dataset with a different tagset.

ADJ adjective
ADV adverb
IN preposition
PART particle (e.g. after verb, looks like a preposition)
PRON pronoun
NUM number
CONJ conjunction
UH filler, exclamation
TO infinitive
VERB verb
MODAL modal verb
DET determiner
NOUN noun
PERIOD end of sentence punctuation
PUNCT other punctuation
X miscellaneous hard-to-classify items

The provided code converts all words to lowercase when it reads in each sentence. It also adds two dummy words at the ends of the sentnce: START (tag START) and END (tag END). These tags are just for standardization; they will not be considered in accuracy computations.

Since the first word in each input sentence is always START, the initial probabilities for your HMM can have a very simple form, which you can hardcode into your program. Your code only needs to learn the transition probabilities and the emission probabilities. The transition probabilities from START to the next tag will encode the probability of starting a sentence with the various different tags.

Viterbi_2

The previous Vitebi tagger fails to beat the baseline because it does very poorly on unseen words. It's assuming that all tags have similar probability for these words, but we know that a new word is much more likely to have the tag NOUN than (say) CONJ. For this part, you'll improve your emission smoothing to match the real probabilities for unseen words.

Words that occur only once in the training data ("hapax" words) have a distribution similar to the words that appear only in the test/development data. Extract these words from the training data and calculate the probability of each tag on them. When you do your Laplace smoothing of the emission probabilities for tag T, scale Laplace smoothing constant $\alpha$ by the corresponding probability that tag T occurs among the set of hapax words. That is, you are dividing your old Laplace smoothing constant $\alpha$ among the tags, weighted by the frequency of that tag in the hapax words.

If a tag is never seen in the set of hapax words, treat it as if you saw it once.

This improved version of the Viterbi code should be named viterbi_2 and be in the file viterbi_2.py. It should have a significantly better unseen word accuracy on the Brown development dataset, e.g. over 66.5%. It also beat the baseline on overall accuracy (e.g. 95.5%).

The hapax word tag probabilities may be different from one dataset to another. So your Viterbi code should compute them dynamically from its training data each time it runs.

Hints:

Most of the code in viterbi_2.py is the same as that of viterbi_1.py
Tag X rarely occurs in the dataset.
You may wish to tweak your Laplace smoothing constant after you have this part basically working ok.
If mysterious things are happening in your output, esp. lots of X tags, consider adding conditional print statements to make sure no probabilities are ending up equal to zero.
Setting a high value for the Laplace smoothing constant may overly smooth the emission probabilities and break your statistical computations. A small value for the Laplace smoothing constant, e.g. 1e-5, may help.
We do not recommend using global variables in your implementation since Gradescope runs a number of different tests within the same python environment. Global values set during one test will carry over to subsequent tests.

Viterbi_3

For this part, we're going to improve Viterbi_2's accuracy on unseen words by using the form of each word to help guess its part of speech tag. This will make a big improvement to the unseen word accuracy and a modest improvement to the overall accuracy. This improved algorithm should be named viterbi_3.

First, you should write a helper function to classify a word into one of six types:

starts and ends with a numerical digit (e.g. "201-3456")
very short, i.e. 1-3 characters long (e.g. "cnn")
short, i.e. 4-9 characters long
- ends in -s (e.g. slugs)
- ends in any other letter (e.g. walked)
long, i.e. at least 10 characters long
- ends in -s (e.g. refrigerators)
- ends in any other letter (e.g. polymorphism)

The input should be a word. The output should be a string label for the type. You can pick your favorite label for each of the six word types and hardcode the list into your program. However, notice that the labels should be in ALL CAPS so that the Viterbi code cannot confuse them with real words (which are all lowercase in our datasets).

For Viterbi_1, you had one Laplace smoothing $\alpha$ for all unseen words. For Viterbi_2, we divided $\alpha$ into a separate value for each tag, based on how often we saw the tag in the hapax words. For Viterbi_3, you should divide $\alpha$ into separate values for each tag and each word type. So if there are k tags and j word types, there should be kj separate $\alpha$ values that add up to the original $\alpha$ . Do this by figuring out how often each tag-type pair occurs in the set of hapax words, and scaling alpha by this count divided by the total number of hapax words.

If you never see a tag-type pair in the hapax words, treat it as if you saw it once. Depending on exactly how you build your code, this method filling in values for missing type-tag pairs might make all the little $\alpha$ values add up to something that isn't quite the original value of $\alpha$ . However, your choice of how to handle this issue should have only tiny effects on the performance of your algorithm.

OK, so think about a single tag T. How do you do Laplace smoothing for the word emission probabilities for tag T? Each of the six word types has a smoothing constant $\alpha$ that should be used to compute the emission probability for an unseen word of that type. For handling the words that were seen in the training data, first add up the six $\alpha$ values for the six types of unseen words. This will give you one larger constant $\alpha$ that you can use in the Laplace smoothing formula for all the words that you saw in the training data.

Using this method, our model solution increases its unseen word accuracy for the Brown development dataset from 66.92% to 73.72%. The overall accuracy on this dataset increases from 95.67% to 95.96%.