CS440/ECE448 Assignment 3

Processing math: 100%

CS440/ECE448 Fall 2019

Assignment 3: Naive Bayes

Due date: Monday October 14th, 11:55pm

Updated 2018 By: Justin Lizama and Medhini Narasimhan

Updated 2019 By: Kedan Li and Weilin Zhang

Suppose that we're building an app that recommends movies. We've scraped a large set of reviews off the web, but (for obvious reasons) we would like to recommend only movies with positive reviews. In this assignment, you will use the Naive Bayes algorithm to train a binary sentiment classifier with a dataset of movie reviews. The task is to learn a bag of words (unigram) model that will classify a review as positive or negative based on the words it contains.

General Guidelines
Problem Statement
Dataset
Unigram Model
Part1: Training and Development
Part2: Stemming
Extra Credit Suggestion
Provided Code Skeleton
Report Checklist

General guidelines

Basic instructions are similar to MP 2:

For general instructions, see the main MP page and the course policies.
In addition to the standard python libraries, you should import nltk and numpy.
Code will be submitted on gradescope.
The extra credit portion will be submitted separately on gradescope. In your MP average, the extra credit is worth 10% of the value of the regular part of the assignment.

Extra credit may be awarded for going beyond expectations or completing the suggestions below. Notice, however, that the score for each MP is capped at 110%.

Problem Statement

You are given a dataset consisting of positive and negative reviews. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen review. Use the development set to test the accuracy of your learned model. We will have a separate (unseen) test set that we will use to run your code after you turn it in.

Dataset

The dataset consists of 10000 positive and 3000 negative reviews, a subset of the Stanford Movie Review Dataset, which was originally introduced by this paper. We have split this data set for you into 5000 development examples and 8000 training examples. The data set can be downloaded here: zip tar.

Unigram Model

The bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here each email consists of a group of words. Using Bayes theorem, you need to compute the probability of a review being positive given the words in the review: Thus you need to estimate the posterior probabilities:

$P( \mathrm{Type} = \mathrm{Positive} | \mathrm{Words}) = \frac{P(\mathrm{Type}=\mathrm{Positive})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Type}=\mathrm{Positive})$

$P( \mathrm{Type} = \mathrm{Negative} | \mathrm{Words}) = \frac{P(\mathrm{Type}=\mathrm{Negative})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Type}=\mathrm{Negative})$ It is standard practice to use the log probabilities so as to avoid underflow. Notice that

$P(\mathrm{words})$ is the same in both formulas, so you can omit it.

Part 1: Training and Development

Before starting: Make sure you install the nltk package with 'pip install nltk'. Otherwise, the provided code will not run. More information about the package can be found here

Training: To train the algorithm you are going to need to build a bag of words model using the emails. After you build the model, you will need to estimate the log-likelihoods $\log P(\mathrm{Word}|\mathrm{Type}=\mathrm{Positive})$ . The variable $\mathrm{Type}$ can only take on two values, $\mathrm{Positive}$ or $\mathrm{Negative}$ . Additionally, you will need to make sure you smooth the likelihoods to prevent zero probabilities. In order to accomplish this task, use Laplace smoothing.
Development: After you have computed the log-likelihoods, you will have your model predict whether or not the emails from the development set are positive. In order to do this, you will do maximum a posteriori (MAP) classification using the sum of log probabilities shown above.

Your code should use only the training set to learn the individual probabilities. Do not use the development data or any external sources of information.

The whole process (training and testing on the development set) should take no more than 1 minute and 1.1G of memory. If your code get killed on autograder, we recommend that you cut your vocabulary set to 5000 (i.e. only maintain top 5000 words).

Part 2: Improve your F1

We provide some flags for you to play around.

For example, you can use the --lower_case flag to cast all words into lower case. You can also tune the laplace smoothing parameter by using the --laplace flag. You can overwrite the default laplace parameter in your naiveBayes().

You should be able to boost the model performance up a bit.

How to change your flag

To overwrite default flags, you need to add some global variables in your naive_bayes.py file outside the naiveBayes() function.

For example, if you want to change the --stemming flag, then add a global variable called STEMMING.

Similarly, if you want to change the --lower_case flag, then add a global variable called LOWER_CASE.

Note that if you want to change your --laplace flag, you don't want to add a global flag, instead, you can overwrite the laplace parameter inside your naiveBayes() function.

You don't need to do worry about the --pos_prior flag, we will pass in the correct prior for you. The autograder will run on default values if you don't add global flags/overwrite the laplace parameter.

Extra Credit

For the extra credit part, you will attempt to get the best accuracy possible out of your classifier, while still training only on the provided training dataset.

We recommend that you implement the naive Bayes algorithm over a bigram model as opposed to the unigram model. Then combine the bigram model and the unigram model into a mixture model defined with parameter $\lambda$ : $(1-\lambda)P(Y) \prod_{i=1}^n P(w_i|Y) + \lambda P(Y) \prod_{i=1}^m P(b_i|Y)$ You will need to find the best parameter $\lambda$ that gives the highest classification accuracy. To change your --stemming, --lower_case, --laplace, see instructions in part2.

Gradescope will be running a leaderboard for extra credit submissions. You can optionally set up an alias to compete for a place on the leaderboard (submit a separate naive_bayes.py). Placing high on the leaderboard will not affect your grade, but it may make you feel good.

Provided Code Skeleton

We have provided (tar, zip) all the code to get you started on your MP, which means you will only have to implement the logic behind naive bayes.

reader.py - This file is responsible for reading in the data set. It takes in all of the emails, splits each of those emails into a list of words, stems them if you used the --stemming flag, and then stores all of the lists of words into a larger list (so that each individual list of words corresponds with a single email).
mp3.py - This is the main file that starts the program, and computes the accuracy, precision, recall, and F1-score using your implementation of naive Bayes.
naive_bayes.py This is the file where you will be doing all of your work. The function naiveBayes() takes as input the training data, training labels, development set, and a smoothing parameter. The training data provided is the output from reader.py. The training labels is the list of labels corresponding to each email in the training data. The development set is the list of emails that you are going to test your implementation on. The smoothing parameter is the laplace smoothing parameter you specified with --laplace (it is 1 by default). You will have naiveBayes() output the predicted labels for the development set from your naive Bayes model.

Do not modify the provided code. You will only have to modify naive_bayes.py.

To understand more about how to run the MP, run python3 mp3.py -h in your terminal.

Code Submission

Submit naive_bayes.py on gradescope.

If you do the extra credit, submit your extra credit version of naive_bayes.py to the separate extra credit assignment on gradescope.