CS440/ECE448 Fall 2019
Assignment 3: Naive Bayes
Due date: Monday October 14th, 11:55pm
|
Updated 2018 By: Justin Lizama and Medhini Narasimhan
Updated 2019 By: Kedan Li and Weilin Zhang
Suppose that we're building an app that recommends movies. We've scraped a
large set of reviews off the web, but (for obvious reasons) we would like to
recommend only movies with positive reviews.
In this assignment, you will
use the Naive Bayes algorithm to train a binary sentiment classifier with a dataset of
movie reviews. The task is to learn a bag of words (unigram)
model that will classify a review as positive or negative based on the words it contains.
Contents
General guidelines
Basic instructions are similar to MP 2:
- For general instructions, see the main MP page and the
course policies.
- In addition to the standard python libraries, you should import nltk and numpy.
- Code will be submitted on gradescope.
- The extra credit portion will be submitted separately on gradescope.
In your MP average, the extra credit is worth 10% of the value of the regular part of the assignment.
Extra credit may be awarded for going beyond expectations or completing the suggestions below. Notice, however, that the score for each MP
is capped at 110%.
You are given a dataset consisting of positive and negative reviews.
Using the training set, you will learn a Naive Bayes classifier that will predict the right class
label given an unseen review. Use the development set to test the accuracy of your learned model.
We will have a separate (unseen) test set that we will use to run your code after you turn it in.
The dataset consists of 10000 positive and 3000 negative reviews, a subset of the Stanford Movie Review Dataset, which was originally introduced by this paper.
We have split this data set for you into 5000 development examples and 8000 training examples.
The data set can be downloaded here:
zip
tar.
Unigram Model
The bag of words model in NLP is a simple unigram model which considers a text to
be represented as a bag of independent words. That is, we ignore the position
the words appear in, and only pay attention to their frequency in the text.
Here each email consists of a group of words. Using Bayes theorem, you need to
compute the probability of a review being positive given the words in the review:
Thus you need to estimate the posterior probabilities:
P(Type=Positive|Words)=P(Type=Positive)P(Words)∏All wordsP(Word|Type=Positive)
P(Type=Negative|Words)=P(Type=Negative)P(Words)∏All wordsP(Word|Type=Negative)
It is standard practice to use the log probabilities so as to avoid underflow.
Notice that P(words) is the same in both formulas, so you can omit it.
Part 1: Training and Development
Before starting: Make sure you install the nltk package with 'pip install nltk'.
Otherwise, the provided code will not run. More information about the package
can be found here
- Training: To train the algorithm you are going to need to build
a bag of words model using the emails. After you build the model, you will
need to estimate the log-likelihoods logP(Word|Type=Positive).
The variable Type can only take on two values, Positive or Negative.
Additionally, you will need to make sure you smooth the likelihoods to prevent
zero probabilities. In order to accomplish this task, use Laplace smoothing.
- Development: After you have computed the log-likelihoods, you will
have your model predict whether or not the emails from the development set are positive.
In order to do this, you will do maximum a posteriori (MAP) classification using
the sum of log probabilities shown above.
Your code should use only the training set to learn the individual probabilities.
Do not use the development data or any external sources of information.
The whole process (training and testing on the development set) should take no more than
1 minute and 1.1G of memory. If your code get killed on autograder, we recommend that you cut your vocabulary set to 5000 (i.e. only maintain top 5000 words).
We provide some flags for you to play around.
For example, you can use the --lower_case flag to cast all words into lower case. You can also tune the laplace smoothing parameter by using the --laplace flag. You can overwrite the default laplace parameter in your naiveBayes().
You should be able to boost the model performance up a bit.
How to change your flag
To overwrite default flags, you need to add some global variables in your naive_bayes.py file outside the naiveBayes() function.
For example, if you want to change the --stemming flag, then add a global variable called STEMMING.
Similarly, if you want to change the --lower_case flag, then add a global variable called LOWER_CASE.
Note that if you want to change your --laplace flag, you don't want to add a global flag, instead, you can overwrite the laplace parameter inside your naiveBayes() function.
You don't need to do worry about the --pos_prior flag, we will pass in the correct prior for you. The autograder will run on default values if you don't add global flags/overwrite the laplace parameter.
For the extra credit part, you will attempt to get the best accuracy possible out of
your classifier, while still training only on the provided training dataset.
We recommend that you
implement the naive Bayes algorithm over a bigram model as opposed to the unigram model. Then combine the bigram model and the unigram model into a mixture model defined with parameter λ:
(1−λ)P(Y)n∏i=1P(wi|Y)+λP(Y)m∏i=1P(bi|Y)
You will need to find the best parameter λ that gives the highest classification accuracy. To change your --stemming, --lower_case, --laplace, see instructions in part2.
Gradescope will be running a leaderboard for extra credit submissions. You can optionally set up
an alias to compete for a place on the leaderboard (submit a separate naive_bayes.py). Placing high on the leaderboard will not affect
your grade, but it may make you feel good.
We have provided
(tar,
zip)
all the code to get you started on your MP, which
means you will only have to implement the logic behind naive bayes.
- reader.py - This file is responsible for reading in the data set.
It takes in all of the emails, splits each of those emails into a list of words,
stems them if you used the --stemming flag, and then stores all of the lists of
words into a larger list (so that each individual list of words corresponds with a single email).
- mp3.py - This is the main file that starts the program, and computes the
accuracy, precision, recall, and F1-score using your implementation of naive Bayes.
- naive_bayes.py This is the file where you will be doing all of your work.
The function naiveBayes() takes as input the training data, training labels, development set,
and a smoothing parameter. The training data provided is the output from reader.py.
The training labels is the list of labels corresponding to each email in the training data.
The development set is the list of emails that you are going to test your implementation on.
The smoothing parameter is the laplace smoothing parameter you specified with --laplace (it is 1 by default).
You will have naiveBayes() output the predicted labels for the development set from your naive Bayes model.
Do not modify the provided code. You will only have to modify naive_bayes.py.
To understand more about how to run the MP, run python3 mp3.py -h in your terminal.
Submit naive_bayes.py on gradescope.
If you do the extra credit, submit your extra credit version of
naive_bayes.py to the separate extra credit assignment on gradescope.
|