Updated 2019 By: Kedan Li and Weilin Zhang Updated 2018 By: Justin Lizama and Medhini Narasimhan Suppose that we're building an app that recommends movies. We've scraped a large set of reviews off the web, but (for obvious reasons) we would like to recommend only movies with positive reviews. In this assignment, you will use the Naive Bayes algorithm to train a binary sentiment classifier with a dataset of movie reviews. The task is to learn a bag of words (unigram, bigram) model that will classify a review as positive or negative based on the words it contains. Contents
General guidelinesBasic instructions are similar to MP 2:
Problem StatementYou are given a dataset consisting of positive and negative reviews. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen review. Use the development set to test the accuracy of your learned model. We will have a separate (unseen) test set that we will use to run your code after you turn it in. DatasetThe dataset consists of 10000 positive and 3000 negative reviews, a subset of the Stanford Movie Review Dataset, which was originally introduced by this paper. We have split this data set for you into 5000 development examples and 8000 training examples. The data set can be downloaded here: zip tar.Unigram Model (80% credits)The bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here each email consists of a group of words. Using Bayes theorem, you need to compute the probability of a review being positive given the words in the review. Thus you need to estimate the posterior probabilities and return the type label (Positive or Negative) with the higher priority. P(Type=Positive|Words)=P(Type=Positive)P(Words)∏All wordsP(Word|Type=Positive) P(Type=Negative|Words)=P(Type=Negative)P(Words)∏All wordsP(Word|Type=Negative) It is a standard practice to use the log probabilities so as to avoid underflow. Notice that P(words) is the same in both formulas, so you can omit it.Bigram Model (20% credits)Unigram has assumption that every word is independent, which is too strong and not true in real world. Instead, we can use bigram model to process every two adjacent words together. Similarly, we have posterior probabilities as P(Type=Positive|Words)=P(Type=Positive)P(Words)∏All word pairsP(Word Pair|Type=Positive) P(Type=Negative|Words)=P(Type=Negative)P(Words)∏All word pairsP(Word Pair|Type=Negative) Then combine the bigram model and the unigram model into a mixture model defined with parameter λ: (1−λ)log[P(Y)n∏i=1P(wi|Y)]+λlog[P(Y)m∏i=1P(bi|Y)] You will need to find the best parameter λ that gives the highest classification accuracy. Part 0: Using Virtual EnvironmentWhat is a virtual environment? Python virtual environment is an isolated environment for Python projects. Imagine you have two projects ProjectA and ProjectB. For some legacy issues, ProjectA requires Python2 while ProjectB only works in Python3. In such case, ideally you need separate environments to run these two projects. Python virtual environment makes your switch between these projects smoothly. Meanwhile, even without conflict between projects, you may not want to contaminate system-level Python environment with your own packages. Since some package version mismatch may crash some of your system processes. Therefore, it may be a good habit to create a new virtual environment when you start some new research/course projects since you never know what you will need in the future.Is it difficult to use virtual environment? No, not at all! conda, an OS-agnostic tool, takes care of all cumbersome things for you. How to use virutal environment? Download and install corresponding miniconda for your OS. Do not worry about the Python version shown on the webpage if it does not match your needs. That version only demonstrates the Python version miniconda installs as default. You have all freedom to install any Python version later with conda, which is the reason why we need virtual environment. I know something called Anaconda, is it the same as miniconda? Anaconda is essentially the same as miniconda. However, Anaconda is much larger (~3GB) than miniconda (~50MB) since Anaconda pre-installs many packages, which you may not need. Actually, all packages installed in Anaconda could be installed later based on your needs. We recommned using miniconda. More comparison could be found here. OK, I have installed miniconda, what's next? In this project, we use Python 3.7. No matter whether you have Python 3.7 already, we recommend getting yourself familar with conda. Create a virtual environment for this MP with the following command: Part 1: Training and DevelopmentBefore starting: Make sure you install the nltk package with 'pip install nltk'. Otherwise, the provided code will not run. More information about the package can be found here
The whole process (training and testing on the development set) should take no more than 1 minute and 1.1G of memory. With these restrictions, you may be able to use the entire vocabulary in the training data. However, if your implementation is less efficient, you may need to cut your vocabulary size (i.e. only maintain top 5000 words). Part 2: Improve your AccuracyIn naive_bayes.py, we have two functions naiveBayes and bigramBayes, which are used for unigram and bigram models respectively. Each function has some key-value arguments which we think will affect the final accuracies. Try different values for those arguments and see how good the model will be.Provided Code SkeletonWe have provided (tar, zip) all the code to get you started on your MP, which means you will only have to implement the logic behind naive bayes.
To understand more about how to run the MP, run python3 mp3.py -h in your terminal. Code SubmissionSubmit naive_bayes.py on gradescope. Gradescope will be running a leaderboard for extra credit submissions. You can optionally set up an alias to compete for a place on the leaderboard (submit a separate naive_bayes.py). Placing high on the leaderboard will not affect your grade, but it may make you feel good. |