CS440/ECE448 Fall 2018
Assignment 5: HMM POS tagging
Due date: Monday November 5, 11:55pm
|
Margaret Fleck and Renxuan Wang
adapted from a U. Penn assignment
For this MP, you will implement part of speech (POS) tagging
using an HMM model Make sure you understand the algorithm
before you start writing code, e.g. look at
lectures 20 and 21 and
Chapter 8 of Jurafsky and Martin.
The design of the algorithms (Baseline and Viterbi) is fixed.
However, the Viterbi algorithm requires smoothing and our description
is vague about the details. A major part of your task is to fill
these in.
For this MP, you should not need to use any modules besides the ones
built into Python. It's ok to use numpy if you must. Any other
package requires prior permission.
Contents
General guidelines
Basic instructions are the same as in MP 1. To summarize:
Your moodle submission should include:
- Your report (formatted pdf).
- A copy of viterbi.py containing your tagging code.
- A copy of mp5.py if required to add extra functionality
for extra credit.
The autograder will use our mp5.py functions.
So your basic functionality must work with this interface.
- (Groups only) Statement of individual contribution.
Extra credit may be awarded for going beyond expectations or
completing the suggestions below. Notice, however, that the score for
each MP is capped at 110%.
The mp5 code reads data from two files. Your tagging function will
be given the training data with tags and the test data without tags.
Your tagger should use the training data to
to estimate the probabilities it requires, and then use this
model to infer tags for the test input.
The main mp5 function will compare these against the correct
tags and report your accuracy.
The data is divided into sentences. Your tagger should process
each sentence independently.
You will need to write two tagging functions:
- Baseline
- Viterbi: HMM tagger
The Baseline tagger considers each word independently, ignoring previous
words and tags. For each word w, it counts how many times w occurs
with each tag in the training data. When processing the test data,
it consistently gives w the tag that was seen most often.
The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm
as seen in lecture or Jurafsky and Martin. That is, the probability of each
tag depends only on the previous tag, and the probability of each word depends
only on the corresponding tag. This model will need to estimate
three sets of probabilities:
- Initial probabilities (How often does each tag occur at the start of
a sentence?)
- Transition probabilities (How often does tag tb follow tag
ta?)
- Emission probabilities (How often does tag t yield word w?)
It's helpful to think of your processing in five steps:
- Count occurrences of tags, tag pairs, tag/word pairs.
- Compute smoothed probabilities
- Take the log of each probability
- Construct the trellis. Notice that
for each tag/time pair, you must store not only
the probability of the best path but also a pointer to the
previous tag/time pair in that path.
- Return the best path through the trellis.
You'll need to use smoothing to get good performance. The details are
up to you. Start with a simple version of Laplace smoothing.
The code package for the MP
(tar,
zip)
contains three sets of training and development data
- Brown corpus
- MASC corpus (from the Open American National Corpus)
- Both, i.e. the union of the two previous datasets
You should use the training data to train the parameters of your model
and the development sets to test its accuracy.
We will use a separate (unseen) set of data to test your code after you submit it.
The code skeleton in the package will read data from the corpus
files, run your tagging algorithm, and report the accuracy of
its output.
Here is an example of how to run the code on the Brown corpus
data, specifying the baseline algorithm.
python3 mp5.py --train brown-training.txt --test brown-dev.txt --baseline
There is a switch that will toggle the corpus reader between case
sensitive (i.e. keep the case as shown in the input) and
case insensitive (i.e. make the words entirely lowercase).
You should test your code on the all three datasets. For each
dataset, you should report performance of both Baseline and Viterbi on
- the training data (i.e. submit the training data for both
input files), and
- the development data.
Performance on the training dataset should be high, because it's cheating.
But this is a good way to test for code bugs.
You should run all these tests for both the case sensitive and case
insensitive code settings.
Your report should briefly describe your implemented solution
but primarily concentrate on
- How you handled non-obvious details (esp. smoothing and associated parameters), and
- Test results
Be sure to explain why you made the choices that you did. This may include
selected test results from configurations that worked less well,
to help motivate your final choices.
Present your test results in an organized manner using tables. Discuss which
choices worked best and which versions of the problem (e.g. choice of dataset)
were earlier or harder.
For this MP, the algorithms are fixed. Do not change them.
Your main opportunity for
extra credit involves optimizing the smoothing methods to improve
performance. Notice that you may need to use different smoothing
methods for different groups of probabilities. (See e.g. the lecture
21 summary for some ideas.)
You may also wish to do more extensive experimentation.
For example,
- Run the tagger on
(slightly) mismatched training and test data. E.g. how well can
you get it to perform if trained on Brown but tested on
MASC?
- Separately report performance on test sentences that did/didn't
contain words not seen in the training data.
You could also try automatically tuning key smoothing parameters,
using a loop that systematically tries a variety of settings.
You could try using the form of an unseen word (e.g. its length, its
last few characters) to make a better guess at its part of speech.
If you experiment with several configurations,
you may wish to add extra flag options to the main mp5 function.
If you do this:
- Submit your revised mp5.py
- Make sure that the provided version of mp5.py will run your
best configuration, i.e. the one you'd like the autograder to use.
|