Updated 2021 By: Ayush Sarkar and Kiran Ramnath Updated 2020 By: Jaewook Yeom Updated 2019 By: Kedan Li and Weilin Zhang Updated 2018 By: Justin Lizama and Medhini Narasimhan Spam emails are a common issue we all encounter! In this assignment, you will use the Naive Bayes algorithm to train a Spam classifier with a dataset of emails that is provided to you. The task in Part 1 is to learn a bag of words (unigram) model that will classify an email as spam or ham (not spam) based on the words it contains. In Part 2, you will combine the unigram and bigram models to achieve better performance on email classification. Contents
General guidelinesBasic instructions are similar to MP 1:
Problem StatementYou are given a dataset consisting of Spam and Ham (not spam) emails. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen email. Use the development set to test the accuracy of your learned model, with the included function grade.py. We will have multiple hidden test sets that we will use to run your code after you turn it in. DatasetThe dataset that we have provided to you consists of 1500 ham and 1500 spam emails, a subset of the Enron-Spam dataset provided by Ion Androutsopoulos. This dataset is split into 2000 training examples and 1000 development examples. This dataset is located in the spam_data folder of the template code provided. You will also find another dataset located under the counter-data folder of the template code - this data is only used by our grade.py file to check whether your model implementation is correct.BackgroundThe bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here, each email consists of a group of words. Using Bayes theorem, you need to compute the probability that the label of an email (Y) should be ham (Y=ham) given the words in the email. Thus you need to estimate the posterior probabilities: P(Y=ham|words)=P(Y=ham)P(words)∏x∈words in the e−mailP(X=x|Y=ham) P(Y=spam|words)=P(Y=spam)P(words)∏x∈words in the e−mailP(X=x|Y=spam) You will need to use log of the probabilities to prevent underflow/precision issues; apply log to both sides of the equation. Notice that P(words) is the same in both formulas, so you can omit it (set term to 1).Part 1: Unigram ModelBefore starting: Make sure you install the nltk package with 'pip install nltk' and/or 'pip3 install nltk', depending on which Python version you plan on using (we suggest you use Python 3.8 or 3.9). Otherwise, the provided code will not run. More information about the package can be found here
Laplace Smoothing: You will need to make sure you smooth the likelihoods to prevent zero probabilities. In order to accomplish this task, use Laplace smoothing. This is where the Laplace smoothing parameter will come into play. You can use the following formula for Laplace smoothing when calculating the likelihood probabilities: P(X=x|Y=y)=Count(X=x|Y=y)+kN(Y=y)+k(1+|X|Y=y) where Count(X=x|Y=y) is the number of times that the word was x given that the label was y, k is the Laplace smoothing parameter, N(Y=y)=∑xCount(X=x|Y=y) is the total number of word tokens that exist in e-mails that are labeled Y=y, and |X|Y=y is the number of distinct word types that exist in e-mails labeled Y=y. A note on out of vocabulary (OOV) words: Suppose that the dev set includes any words that never occurred in the training set. In that case, the term Count(X=x|Y=y)=0, but all of the other terms in the above formula are nonzero, so P(X=x|Y=y)≠0. This should not be something you compute during your analysis of the training set, but it's a backoff value that you'll need to have available when you analyze the dev set. Part 2: Unigram and Bigram ModelsYou can choose to find the best parameter λ that gives the highest classification accuracy. There are additional parameters that you can tune (for more details, see here). However, I should note that you can get away with not tuning these parameters at all, since the autograder will choose these values for you when you grade. So feel free to play around with these parameters, but in the end, the hyperparameters you choose when you run on your local machine won't matter in grading. Some weird things happen when using the bigram model. One of them happens when you use Laplace smoothing to estimate P(Bigrami=bi|Y=y) from training set data. Obviously, Count(Bigrami=bi|Y=spam) is the number of times bigram bi occurred in spam e-mails, and k is the bigram Laplace smoothing coefficient. The weird thing is that N(Y=spam) should equal the total number of bigrams that occurred in spam emails, plus the total number of unigrams that occurred in spam emails. Similarly, |X|Y=spam should equal the total number of distinct bigrams that occurred in spam emails, plus the total number of distinct unigrams that occurred in spam emails. We're not sure why the bigram model gives higher accuracy with unigrams counted in the denominator, but for some reason, it does. Part 3: Extra creditFor the extra credit part of this MP, you can try a more informative measure of the importance of each word. tf-idf is used in information retrieval and text mining applications to determine how important a word is to a document within a collection of documents. The tf (term-frequency) term determines how frequently a word appears in a document (thus tf is kind of related to naive Bayes, though it's not exactly the same thing). The idf (inverse document frequency) term determines how rare (and thus perhaps more informative) a word is in the collection. A high tf-idf score might indicate that a word has high term frequency in a particular document, and also that it is has a low document frequency (low number of documents that contains the given word). Provided Code SkeletonWe have provided (tar, zip) all the code to get you started on your MP. Note that the only files you should need to modify are naive_bayes.py for Part 1 and tf-idf.py for Part 2. Those are the only files that you will upload to the autograder. Files Used for Part 1&2
mp1.pyHere is an example of how you would run your code for Part 1 from your terminal/shell:python3 mp1.py --training ../data/spam_data/train --development ../data/spam_data/dev --stemming False --lowercase False --laplace 1.0 --pos_prior 0.8 Here is an example of how you would run your code for Part 2 from your terminal/shell:python3 mp1.py --bigram True --training ../data/spam_data/train --development ../data/spam_data/dev --stemming False --lowercase False --bigram_laplace 1.0 --bigram_lambda 0.5 --pos_prior 0.8 Note that you can and should change the parameters as necessary. To understand more about how to run the MP for Part 1, run python3 mp1.py -h in your terminal.Optional: How to change your flagWe provide some flags for you to play around when running your code (for help on how to run the code, see here). For example, you can use the --lower_case flag to cast all words into lower case. You can also tune the laplace smoothing parameter by using the --laplace flag. You can also tune the --stemming and --pos_prior parameters. You should be able to boost the model performance up a bit by tuning these parameters. However, note that when we grade your MP, we will use our own flags (stemming, lower_case, pos_prior, laplace). grade.pyThis file, similar to MP1, is for you to be able to evaluate whether or not you have implemented the unigram and mixture models correctly. This file consists of three fundamental tests:
python3 grade.py
Files Used for Extra Credit
Do not modify the provided code. You will only have to modify tf_idf.py for the Extra Credit Part. Here is an example of how you would run your code for the Extra Credit Part from your terminal/shell: python3 run_tests_tfidf.py --training data/spam_data/train --development data/spam_data/dev What to SubmitSubmit only your naive_bayes.py file to Gradescope, under the assignment called MP1. |