CS440/ECE448 Assignment 4

CS440/ECE448 Fall 2018

Assignment 5: HMM POS tagging

Due date: Monday November 5, 11:55pm

Margaret Fleck and Renxuan Wang
adapted from a U. Penn assignment

For this MP, you will implement part of speech (POS) tagging using an HMM model Make sure you understand the algorithm before you start writing code, e.g. look at lectures 20 and 21 and Chapter 8 of Jurafsky and Martin.

The design of the algorithms (Baseline and Viterbi) is fixed. However, the Viterbi algorithm requires smoothing and our description is vague about the details. A major part of your task is to fill these in.

For this MP, you should not need to use any modules besides the ones built into Python. It's ok to use numpy if you must. Any other package requires prior permission.

General Guidelines
Problem Statement
Materials
Testing
Report Checklist
Extra Credit

General guidelines

Basic instructions are the same as in MP 1. To summarize:

For general instructions, see the main MP page and the course policies.
We will do some code testing, but it may not be exhaustive.
You will not get credit for tests, features, or anything that has not been mentioned in your report. Make sure to fully describe your design and your testing.

Your moodle submission should include:

Your report (formatted pdf).
A copy of viterbi.py containing your tagging code.
A copy of mp5.py if required to add extra functionality for extra credit. The autograder will use our mp5.py functions. So your basic functionality must work with this interface.
(Groups only) Statement of individual contribution.

Extra credit may be awarded for going beyond expectations or completing the suggestions below. Notice, however, that the score for each MP is capped at 110%.

Problem Statement

The mp5 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main mp5 function will compare these against the correct tags and report your accuracy.

The data is divided into sentences. Your tagger should process each sentence independently.

You will need to write two tagging functions:

Baseline
Viterbi: HMM tagger

The Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often.

The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:

Initial probabilities (How often does each tag occur at the start of a sentence?)
Transition probabilities (How often does tag \(t_b\) follow tag \(t_a\)?)
Emission probabilities (How often does tag t yield word w?)

It's helpful to think of your processing in five steps:

Count occurrences of tags, tag pairs, tag/word pairs.
Compute smoothed probabilities
Take the log of each probability
Construct the trellis. Notice that for each tag/time pair, you must store not only the probability of the best path but also a pointer to the previous tag/time pair in that path.
Return the best path through the trellis.

You'll need to use smoothing to get good performance. The details are up to you. Start with a simple version of Laplace smoothing.

The materials

The code package for the MP (tar, zip) contains three sets of training and development data

Brown corpus
MASC corpus (from the Open American National Corpus)
Both, i.e. the union of the two previous datasets

You should use the training data to train the parameters of your model and the development sets to test its accuracy. We will use a separate (unseen) set of data to test your code after you submit it.

The code skeleton in the package will read data from the corpus files, run your tagging algorithm, and report the accuracy of its output. Here is an example of how to run the code on the Brown corpus data, specifying the baseline algorithm.

python3 mp5.py --train brown-training.txt --test brown-dev.txt --baseline

There is a switch that will toggle the corpus reader between case sensitive (i.e. keep the case as shown in the input) and case insensitive (i.e. make the words entirely lowercase).

Testing

You should test your code on the all three datasets. For each dataset, you should report performance of both Baseline and Viterbi on

the training data (i.e. submit the training data for both input files), and
the development data.

Performance on the training dataset should be high, because it's cheating. But this is a good way to test for code bugs.

You should run all these tests for both the case sensitive and case insensitive code settings.

Report Checklist

Your report should briefly describe your implemented solution but primarily concentrate on

How you handled non-obvious details (esp. smoothing and associated parameters), and
Test results

Be sure to explain why you made the choices that you did. This may include selected test results from configurations that worked less well, to help motivate your final choices.

Present your test results in an organized manner using tables. Discuss which choices worked best and which versions of the problem (e.g. choice of dataset) were earlier or harder.

Extra Credit

For this MP, the algorithms are fixed. Do not change them. Your main opportunity for extra credit involves optimizing the smoothing methods to improve performance. Notice that you may need to use different smoothing methods for different groups of probabilities. (See e.g. the lecture 21 summary for some ideas.)

You may also wish to do more extensive experimentation. For example,

Run the tagger on (slightly) mismatched training and test data. E.g. how well can you get it to perform if trained on Brown but tested on MASC?
Separately report performance on test sentences that did/didn't contain words not seen in the training data.

You could also try automatically tuning key smoothing parameters, using a loop that systematically tries a variety of settings.

You could try using the form of an unseen word (e.g. its length, its last few characters) to make a better guess at its part of speech.

If you experiment with several configurations, you may wish to add extra flag options to the main mp5 function. If you do this:

Submit your revised mp5.py
Make sure that the provided version of mp5.py will run your best configuration, i.e. the one you'd like the autograder to use.