Getting Started¶
What you need to do:
Download and extract the assignment files (mp03.zip)
Implement the required functions in
submitted.py:baseline(test, train)viterbi(test, train)analyze_ambiguous_words(train_sentences, predicted_sentences, tag_sentences, k=5)
Test locally with
python grade.pySubmit
submitted.pyto Gradescope
Files you'll work with:
submitted.py- Your main work goes hereutils.py- Helper functions for data loading and evaluationdata/- Training and test datasets (Brown corpus)tests/- Test files (don't modify these)mp03_notebook.ipynb- This debugging notebook
The first thing you need to do is to download this file: mp03.zip. It has the following content:
submitted.py: Your homework. Edit, and then submit to Gradescope.mp03_notebook.ipynb: This is a Jupyter notebook to help you debug. You can completely ignore it if you want, although you might find that it gives you useful instructions.grade.py: Once your homework seems to be working, you can test it by typingpython grade.py, which will run the visible tests.tests/test_visible.py: This file contains visible unit tests for the main assignment parts (baseline, viterbi, and analyze_ambiguous_words function).data: This directory contains the training and test datasets.utils.py: This is an auxiliary program that you can use to read the data, evaluate the accuracy, etc.
This file will walk you through the whole MP, giving you instructions and debugging tips as you go.
Table of Contents¶
For this MP, you will implement part of speech (POS) tagging using an HMM model. Make sure you understand the algorithm before you start writing code.
Reading the data¶
The dataset consists of thousands of sentences with ground-truth POS tags.
The provided load_dataset function will read in the data as a nested list with the outer dimension representing each sentence and inner dimensin representing each tagged word. The following cells will help you go through the representation of the data.
The provided code converts all words to lowercase. It also adds a START and END tag for each sentence when it loads the sentence. These tags are just for standardization. They will not be considered in accuracy computation.
import utils
train_set = utils.load_dataset('data/brown-training.txt')
dev_set = utils.load_dataset('data/brown-test.txt')
print('training set has {} sentences'.format(len(train_set)))
print('dev set has {} sentences'.format(len(dev_set)))
print('The first sentence of training set has {} words'.format(len(train_set[0])))
print('The 10th word of the first sentence in the training set is "{}" with ground-truth tag "{}"'.format(train_set[0][9][0], train_set[0][9][1]))
training set has 35655 sentences dev set has 9912 sentences The first sentence of training set has 27 words The 10th word of the first sentence in the training set is "investigation" with ground-truth tag "NOUN"
print('Here is an sample sentence from the training set:\n', train_set[0])
Here is an sample sentence from the training set:
[('START', 'START'), ('the', 'DET'), ('fulton', 'NOUN'), ('county', 'NOUN'), ('grand', 'ADJ'), ('jury', 'NOUN'), ('said', 'VERB'), ('friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'IN'), ("atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', 'PUNCT'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", 'PUNCT'), ('that', 'CONJ'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', 'PERIOD'), ('END', 'END')]
Tagset
The following is an example set of 16 part of speech tags. This is the tagset used in the provided Brown corpus. But remember you should not hardcode anything regarding this tagset because we will test your code on two other datasets with a different tagset.
- ADJ adjective
- ADV adverb
- IN preposition
- PART particle (e.g. after verb, looks like a preposition)
- PRON pronoun
- NUM number
- CONJ conjunction
- UH filler, exclamation
- TO infinitive
- VERB verb
- MODAL modal verb
- DET determiner
- NOUN noun
- PERIOD end of sentence punctuation
- PUNCT other punctuation
- X miscellaneous hard-to-classify items
Taggers and Required Functions
You will need to implement four main functions in submitted.py:
- baseline(test, train) - Baseline tagger that assigns the most frequent tag for each word
- viterbi(test, train) - HMM Viterbi tagger using dynamic programming
- analyze_ambiguous_words(train_sentences, predicted_sentences, tag_sentences, k=5) - Analysis function for studying tagging errors on ambiguous words
For implementation of this MP, You may use numpy (though it's not needed). You may not use other non-standard modules (including nltk).
You should use the provided training data to train the parameters of your model and the test sets to test its accuracy.
In addition, your code will be tested on two hidden datasets that are not available to you, which has different number of tags and words from the ones provided to you. So do NOT hardcode any of your important computations, such as initial probabilities, transition probabilities, emission probabilities, number or name of tags, and etc. We will inspect code for hardcoding computations/values and will penalize such implementations.
Testing: Run python grade.py to test your implementations. The autograder will test:
- Main functions:
baseline(),viterbi(), andanalyze_ambiguous_words() - Performance benchmarks and correctness on visible test cases
Baseline Tagger
The Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it should guess the tag that's seen the most often in training dataset.
For all seen word w:
Tag_w = argmax over t in T (# times tag t is matched to word w)
For all unseen word w':
Tag_w' = argmax over t in T (# times tag t appears in the training set)
Important Note: In addition to implementing the baseline() function, you must also implement an analyze_ambiguous_words() function that analyzes tagging errors on ambiguous words.
import submitted
import importlib
importlib.reload(submitted)
print(submitted.__doc__)
Debugging helpers¶
This MP provides several helper functions inside submitted.py. The autograder may test these helpers directly.
Before you implement the helpers, the cells below will raise NotImplementedError — that’s expected. After you implement each helper, re-run these cells to sanity-check your intermediate outputs.
import submitted
help(submitted.compute_word_tag_counts)
help(submitted.get_most_frequent_tag)
help(submitted.predict_baseline_tag)
help(submitted.baseline)
help(submitted.collect_hmm_counts)
help(submitted.compute_transition_logprobs)
help(submitted.compute_emission_logprobs)
help(submitted.viterbi_decode)
# Tiny toy example to understand input/output formats
toy_train = [[('START','START'), ('dog','NN'), ('barks','VBZ'), ('END','END')]]
toy_test = [['START','dog','barks','END']]
try:
wtc, tc = submitted.compute_word_tag_counts(toy_train)
print('Top tags:', tc.most_common(5))
print("Tags for 'dog':", wtc.get('dog', {}))
tag_counts, trans_counts, emit_counts, tagset, vocab = submitted.collect_hmm_counts(toy_train)
print('tagset:', tagset)
print('vocab:', vocab)
trans_lp = submitted.compute_transition_logprobs(trans_counts, tagset)
emit_lp, unk_lp = submitted.compute_emission_logprobs(emit_counts, tag_counts, vocab)
print('Example transition logP(START->NN):', trans_lp.get(('START','NN')))
print("Example emission logP('dog'|NN):", emit_lp.get(('NN','dog')))
tags = submitted.viterbi_decode(toy_test[0], tagset, trans_lp, emit_lp, unk_lp)
print('Decoded tags:', tags)
except NotImplementedError:
print('Not implemented yet — implement helper functions in submitted.py, then re-run this cell.')
Top tags: [('START', 1), ('NN', 1), ('VBZ', 1), ('END', 1)]
Tags for 'dog': Counter({'NN': 1})
tagset: ['END', 'NN', 'START', 'VBZ']
vocab: ['END', 'START', 'barks', 'dog']
Example transition logP(START->NN): -2.9999250020973856e-05
Example emission logP('dog'|NN): -3.999880004137177e-05
Decoded tags: ['START', 'NN', 'VBZ', 'END']
helper: analyze_ambiguous_words (spec + example)¶
analyze_ambiguous_words(train_sentences, predicted_sentences, tag_sentences, k=5)
Goal¶
Return the top-k ambiguous words that your model mis-tags most often on the test set.
A word is ambiguous if it appears with more than one tag in the training set.
Inputs¶
train_sentences: training data, list of sentences, each a list of(word, tag)tuples.predicted_sentences: your model's predictions on the test set, list of sentences, each a list of(word, predicted_tag)tuples.tag_sentences: gold-labeled test set, list of sentences, each a list of(word, gold_tag)tuples.k: number of words to return.
Assume predicted_sentences[i][j][0] == tag_sentences[i][j][0] (same word alignment).
Output format¶
Return a list of length k (or shorter in one edge case; see below).
Each element is a dict with keys:
"word": the word string"train_gold_tag_counts": dict mappingtag -> countfor how many times this word appeared with each tag in TRAIN"test_pred_tag_counts": dict mappingtag -> countfor how many times your model predicted each tag on TEST"test_gold_tag_counts": dict mappingtag -> countfor how many times the gold label was each tag on TEST
Ranking / tie-breaking (deterministic)¶
Let mistakes[w] be the number of test positions where:
- the word is ambiguous in TRAIN, and
pred_tag != gold_tag.
Sort candidate words by:
- descending
mistakes[w] - if tie: descending number of times the word appears on TEST (
sum(test_gold_tag_counts[w].values())) - if still tie: lexicographic order of the word (ascending)
Return the first k.
If fewer than k ambiguous words are mis-tagged¶
- First include all ambiguous words with
mistakes[w] > 0, sorted as above. - If you still have fewer than
k, fill the remaining slots with ambiguous words that appear on TEST but havemistakes[w] == 0, ranked by test frequency (descending), then lexicographic. - If there are still fewer than
k(rare), return as many as exist.
# Example usage (after you implement analyze_ambiguous_words)
import utils, submitted
train_set = utils.load_dataset('data/brown-training.txt')
test_set = utils.load_dataset('data/brown-test.txt')
# Run a tagger (baseline or viterbi) to get predictions
pred = submitted.baseline(utils.strip_tags(test_set), train_set)
# Get top-5 ambiguous mis-tagged words
try:
report = submitted.analyze_ambiguous_words(train_set, pred, test_set, k=5)
for item in report:
print(item["word"])
print(" train:", item["train_gold_tag_counts"])
print(" pred :", item["test_pred_tag_counts"])
print(" gold :", item["test_gold_tag_counts"])
except NotImplementedError:
print("Implement analyze_ambiguous_words in submitted.py first, then re-run.")
to
train: {'TO': 9236, 'IN': 6923, 'NOUN': 1, 'ADV': 1, 'X': 1}
pred : {'TO': 4773}
gold : {'IN': 1977, 'TO': 2796}
that
train: {'CONJ': 3960, 'PRON': 1136, 'DET': 1422, 'ADV': 31, 'X': 1}
pred : {'CONJ': 2044}
gold : {'DET': 421, 'PRON': 358, 'CONJ': 1255, 'ADV': 10}
her
train: {'DET': 1220, 'PRON': 632}
pred : {'DET': 826}
gold : {'DET': 491, 'PRON': 335}
as
train: {'CONJ': 3766, 'IN': 77, 'ADV': 704, 'X': 1}
pred : {'CONJ': 1334}
gold : {'CONJ': 1121, 'ADV': 197, 'IN': 16}
more
train: {'ADJ': 633, 'ADV': 786, 'X': 1}
pred : {'ADV': 402}
gold : {'ADJ': 179, 'ADV': 223}
help(submitted.baseline)
Help on function baseline in module submitted: baseline(test, train)
import time
importlib.reload(submitted)
train_set = utils.load_dataset('data/brown-training.txt')
dev_set = utils.load_dataset('data/brown-test.txt')
start_time = time.time()
predicted = submitted.baseline(utils.strip_tags(dev_set), train_set)
time_spend = time.time() - start_time
accuracy, _, _ = utils.evaluate_accuracies(predicted, dev_set)
multi_tag_accuracy, unseen_words_accuracy, = utils.specialword_accuracies(train_set, predicted, dev_set)
print("time spent: {0:.4f} sec".format(time_spend))
print("accuracy: {0:.4f}".format(accuracy))
print("multi-tag accuracy: {0:.4f}".format(multi_tag_accuracy))
print("unseen word accuracy: {0:.4f}".format(unseen_words_accuracy))
time spent: 0.4107 sec accuracy: 0.9387 multi-tag accuracy: 0.9019 unseen word accuracy: 0.6782
¶
Viterbi: HMM Tagger
The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:
- Initial probabilities (How often does each tag occur at the start of a sentence?)
- Transition probabilities (How often does tag \(t_b\) follow tag \(t_a\)?)
- Emission probabilities (How often does tag t yield word w?)
You can assume that all sentences will begin with a START token, whose tag is START. So your initial probabilities will have a very restricted form, whether you choose to handcode appropriate numbers or learn them from the data. The initial probabilities shown in the textbook/texture examples will be handled by transition probabilities from the START token to the first real word.
It's helpful to think of your processing in five steps:
- Count occurrences of tags, tag pairs, tag/word pairs.
- Compute smoothed probabilities
- Take the log of each probability
- Construct the trellis. Notice that for each tag/time pair, you must store not only the probability of the best path but also a pointer to the previous tag/time pair in that path.
- Return the best path through the trellis.
You'll need to use smoothing to get good performance. Make sure that your code for computing transition and emission probabilities never returns zero. Laplace smoothing is the method we use to smooth zero probability cases for calculating initial probabilities, transition probabilities, and emission probabilities.
For example, to smooth the emission probabilities, consider each tag individually. For a fixed tag T, you need to ensure that \(P_e(W|T)\) produces a non-zero number no matter what word W you give it. You can use Laplace smoothing to fill in a probability for "UNKNOWN" which will be the return value for all words W that were not seen in the training data. For this initial implementation of Viterbi, use the same Laplace smoothing constant \(\alpha\) for all tags.
Please make sure to follow the description to implement your algorithm and do not try to do improvement in this part, as it might make your code fail some of our test cases.
help(submitted.viterbi)
Help on function viterbi in module submitted: viterbi(test, train)
import time
importlib.reload(submitted)
train_set = utils.load_dataset('data/brown-training.txt')
dev_set = utils.load_dataset('data/brown-test.txt')
start_time = time.time()
predicted = submitted.viterbi(utils.strip_tags(dev_set), train_set)
time_spend = time.time() - start_time
accuracy, _, _ = utils.evaluate_accuracies(predicted, dev_set)
multi_tag_accuracy, unseen_words_accuracy, = utils.specialword_accuracies(train_set, predicted, dev_set)
print("time spent: {0:.4f} sec".format(time_spend))
print("accuracy: {0:.4f}".format(accuracy))
print("multi-tag accuracy: {0:.4f}".format(multi_tag_accuracy))
print("unseen word accuracy: {0:.4f}".format(unseen_words_accuracy))
time spent: 9.2806 sec accuracy: 0.9380 multi-tag accuracy: 0.9381 unseen word accuracy: 0.2490
Grade your homework¶
If you've reached this point, and all of the above sections work, then you're ready to try grading your homework! Before you submit it to Gradescope, try grading it on your own machine. This will run some visible test cases.
!python3 grade.py
{
"tests": [
{
"name": "Visible test for the assignment helper:",
"score": 10,
"max_score": 10,
"status": "passed",
"visibility": "visible"
},
{
"name": "Sanity check: baseline count tables have expected types and non-empty content.",
"score": 5,
"max_score": 5,
"status": "passed",
"visibility": "visible"
},
{
"name": "Sanity check HMM helpers:",
"score": 5,
"max_score": 5,
"status": "passed",
"visibility": "visible"
}
],
"leaderboard": [],
"visibility": "visible",
"execution_time": "1.11",
"score": 20
}
Now you should try uploading submitted.py to Gradescope.
Gradescope will run the same visible tests that you just ran on your own machine, plus some additional hidden tests. It's possible that your code passes all the visible tests, but fails the hidden tests. Debug by running your function with a variety of different input parameters, and see if you can get it to respond correctly in all cases.
Once your code works perfectly on Gradescope, with no errors, then you are done with the MP. Congratulations!