Processing math: 100%

CS440/ECE448 Fall 2020

Assignment 3: Naive Bayes

Due date: Wednesday October 7th, End of the Day

Updated 2020 By: Michal Shlapentokh-Rothman and Xiaoming Zhao

Updated 2019 By: Kedan Li and Weilin Zhang

Updated 2018 By: Justin Lizama and Medhini Narasimhan

Suppose that we're building an app that recommends movies. We've scraped a large set of reviews off the web, but (for obvious reasons) we would like to recommend only movies with positive reviews. In this assignment, you will use the Naive Bayes algorithm to train a binary sentiment classifier with a dataset of movie reviews. The task is to learn a bag of words (unigram, bigram) model that will classify a review as positive or negative based on the words it contains.

Contents

General guidelines

Basic instructions are similar to MP 2:

Problem Statement

You are given a dataset consisting of positive and negative reviews. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen review. Use the development set to test the accuracy of your learned model. We will have a separate (unseen) test set that we will use to run your code after you turn it in.

Dataset

The dataset consists of 10000 positive and 3000 negative reviews, a subset of the Stanford Movie Review Dataset, which was originally introduced by this paper. We have split this data set for you into 5000 development examples and 8000 training examples. The data set can be downloaded here: zip tar.

Unigram Model (80% credits)

The bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here each email consists of a group of words. Using Bayes theorem, you need to compute the probability of a review being positive given the words in the review. Thus you need to estimate the posterior probabilities and return the type label (Positive or Negative) with the higher priority. P(Type=Positive|Words)=P(Type=Positive)P(Words)All wordsP(Word|Type=Positive) P(Type=Negative|Words)=P(Type=Negative)P(Words)All wordsP(Word|Type=Negative) It is a standard practice to use the log probabilities so as to avoid underflow. Notice that P(words) is the same in both formulas, so you can omit it.

Bigram Model (20% credits)

Unigram has assumption that every word is independent, which is too strong and not true in real world. Instead, we can use bigram model to process every two adjacent words together. Similarly, we have posterior probabilities as P(Type=Positive|Words)=P(Type=Positive)P(Words)All word pairsP(Word Pair|Type=Positive) P(Type=Negative|Words)=P(Type=Negative)P(Words)All word pairsP(Word Pair|Type=Negative)

Then combine the bigram model and the unigram model into a mixture model defined with parameter λ: (1λ)log[P(Y)ni=1P(wi|Y)]+λlog[P(Y)mi=1P(bi|Y)] You will need to find the best parameter λ that gives the highest classification accuracy.

Part 0: Using Virtual Environment

What is a virtual environment? Python virtual environment is an isolated environment for Python projects. Imagine you have two projects ProjectA and ProjectB. For some legacy issues, ProjectA requires Python2 while ProjectB only works in Python3. In such case, ideally you need separate environments to run these two projects. Python virtual environment makes your switch between these projects smoothly. Meanwhile, even without conflict between projects, you may not want to contaminate system-level Python environment with your own packages. Since some package version mismatch may crash some of your system processes. Therefore, it may be a good habit to create a new virtual environment when you start some new research/course projects since you never know what you will need in the future.

Is it difficult to use virtual environment? No, not at all!
conda, an OS-agnostic tool, takes care of all cumbersome things for you.

How to use virutal environment? Download and install corresponding miniconda for your OS. Do not worry about the Python version shown on the webpage if it does not match your needs. That version only demonstrates the Python version miniconda installs as default. You have all freedom to install any Python version later with conda, which is the reason why we need virtual environment.

I know something called Anaconda, is it the same as miniconda? Anaconda is essentially the same as miniconda. However, Anaconda is much larger (~3GB) than miniconda (~50MB) since Anaconda pre-installs many packages, which you may not need. Actually, all packages installed in Anaconda could be installed later based on your needs. We recommned using miniconda. More comparison could be found here.

OK, I have installed miniconda, what's next? In this project, we use Python 3.7. No matter whether you have Python 3.7 already, we recommend getting yourself familar with conda. Create a virtual environment for this MP with the following command: conda create --name cs440-mp3 python=3.7 Then everytime you want to use the virtual environment, activate it with conda activate cs440-mp3 You can install package with conda install XXX or pip install XXX Either way will install packages only in this cs440-mp3 environment. When you do not want to use environment anymore, deactivate it with conda deactivate More information could be found here.

Part 1: Training and Development

Before starting: Make sure you install the nltk package with 'pip install nltk'. Otherwise, the provided code will not run. More information about the package can be found
here
  • Training: To train the algorithm you are going to need to build a bag of words model using the emails. After you build the model, you will need to estimate the log-likelihoods logP(Word|Type=Positive). The variable Type can only take on two values, Positive or Negative. Additionally, you will need to make sure you smooth the likelihoods to prevent zero probabilities. In order to accomplish this task, use Laplace smoothing.

  • Development: After you have computed the log-likelihoods, you will have your model predict whether or not the emails from the development set are positive. In order to do this, you will do maximum a posteriori (MAP) classification using the sum of log probabilities shown above.
train/val/test split. Your code should use only the training set to learn the individual probabilities and development data for validation. Do not use the development data or any external sources of information. See discussion about train/val/test split here.

The whole process (training and testing on the development set) should take no more than 1 minute and 1.1G of memory. With these restrictions, you may be able to use the entire vocabulary in the training data. However, if your implementation is less efficient, you may need to cut your vocabulary size (i.e. only maintain top 5000 words).

Part 2: Improve your Accuracy

In naive_bayes.py, we have two functions naiveBayes and bigramBayes, which are used for unigram and bigram models respectively. Each function has some key-value arguments which we think will affect the final accuracies. Try different values for those arguments and see how good the model will be.

Provided Code Skeleton

We have provided (tar, zip) all the code to get you started on your MP, which means you will only have to implement the logic behind naive bayes.

  • reader.py - This file is responsible for reading in the data set. It takes in all of the emails, splits each of those emails into a list of words, stems them if you used the --stemming flag, and then stores all of the lists of words into a larger list (so that each individual list of words corresponds with a single email).

  • mp3.py - This is the main file that starts the program, and computes the accuracy, precision, recall, and F1-score using your implementation of naive Bayes.

  • naive_bayes.py This is the file where you will be doing all of your work. Each function takes as input the training data, training labels, development set, and other parameters. The training data provided is the output from reader.py. The training labels is the list of labels corresponding to each email in the training data. The development set is the list of emails that you are going to test your implementation on. You will have naiveBayes() and bigramBayes() to output the predicted labels for the development set from your naive Bayes model.
Do not modify the provided code. You will only have to modify naive_bayes.py.

To understand more about how to run the MP, run python3 mp3.py -h in your terminal.

Code Submission

Submit naive_bayes.py on gradescope. Gradescope will be running a leaderboard for extra credit submissions. You can optionally set up an alias to compete for a place on the leaderboard (submit a separate naive_bayes.py). Placing high on the leaderboard will not affect your grade, but it may make you feel good.