CS440/ECE448 Assignment 4

Processing math: 100%

CS440/ECE448 Fall 2018

Assignment 4: Application of Naive Bayes

Due date: Monday October 22nd, 11:55pm

Updated By: Justin Lizama and Medhini Narasimhan

Spam emails are a common issue we all encounter! In this assignment, you will use the Naive Bayes algorithm to train a Spam classifier with a dataset of emails that is provided to you. The task is to learn a bag of words (unigram) model that will classify an email as Spam depending on the words present in it.

General Guidelines
Problem Statement
Dataset
Unigram Model
Part1: Training and Development
Part2: Stemming
Extra Credit Suggestion
Provided Code Skeleton
Report Checklist

General guidelines

Basic instructions are the same as in MP 1. To summarize:

For general instructions, see the main MP page and the course policies.
We will do some code testing, but it may not be exhaustive.
You will not get credit for tests, features, or anything that has not been mentioned in your report. Make sure to fully describe your design and your testing.

Your moodle submission should include:

Your report (formatted pdf).
A copy of naive_bayes.py containing all your new code
(Groups only) Statement of individual contribution.

Extra credit may be awarded for going beyond expectations or completing the suggestions below. Notice, however, that the score for each MP is capped at 110%.

Problem Statement

You are given a dataset consisting of Spam and Ham (not spam) emails. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen email. Use the development set to test the accuracy of your learned model. Report the accuracy, recall, and F1-Score that you get on your development set. We will have a separate (unseen) test set that we will use to run your code after you turn it in. You may use Numpy in this MP to to program your solution. Aside from that library, no other outside non-standard libraries can be used.

Dataset

The dataset consists of 500 Spam and 500 Ham (not spam) emails. We have split this data set for you into 250 development examples and 750 training examples. Both the training set and development set have an even number of spam vs ham examples. That is, training set has 375 spam and 375 ham, and development set has 125 spam and 125 ham examples. The data set can be downloaded here: zip tar.

The dataset used for this assignment is a subset of the Enron-Spam dataset, provided by Ion Androutsopoulos.

Unigram Model

The bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here each email consists of a group of words. Using Bayes theorem, you need to compute the probability of an email being Spam given the words in the email. Thus you need to estimate the posterior probabilities:

$P( \mathrm{Type} = \mathrm{Spam} | \mathrm{Words}) = \frac{P(\mathrm{Type}=\mathrm{Spam})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Type}=\mathrm{Spam})$ It is standard practice to use the log probabilities so as to avoid underflow. Also,

$P(\mathrm{words})$ is just a constant, so it will not affect your computation. Lastly, we will be ignoring the priors

$P(\mathrm{Type=Spam})$ for this problem. This is because values

$\mathrm{Spam}$ and

$\mathrm{Ham}$ have the same prior probability because the data set is split evenly between the two. Therefore, we can ignore it and use MLE opposed to MAP. With all of these facts in mind you should instead compute:

$\sum_{\mathrm{All}~\mathrm{words}} \log P(\mathrm{Word}|\mathrm{Type}=\mathrm{Spam})$

Part 1: Training and Development

Before starting: Make sure you install the nltk package with 'pip install nltk'. Otherwise, the provided code will not run. More information about the package can be found here

Training: To train the algorithm you are going to need to build a bag of words model using the emails. After you build the model, you will need to estimate the log-likelihoods $\log P(\mathrm{Word}|\mathrm{Type}=\mathrm{Spam})$ . The variable $\mathrm{Type}$ can only take on two values, $\mathrm{Spam}$ or $\mathrm{Ham}$ . Additionally, you will need to make sure you smooth the likelihoods to prevent zero probabilities. In order to accomplish this task, use Laplace smoothing.
Development: After you have computed the log-likelihoods, you will have your model predict whether or not the emails from the development set are spam. In order to do this, you will do maximum likelihood estimation classification using the sum of log probabilities shown above.

Use only the training set to learn the individual probabilities. Report the following results:

Accuracy, recall, and F1 scores on both the training and development sets.
The best Laplace smoothing parameter $\alpha$

Part 2: Stemming

Try using the NLTK porter stemming package to stem the words as a pre-processing step. That is, when running mp4.py add the --stemming flag. Report any changes you observe in the final accuracy and also a possible explanation of these changes.

Extra Credit Suggestion

Implement the naive Bayes algorithm over a bigram model as opposed to the unigram model. Then combine the bigram model and the unigram model into a mixture model defined with parameter

$\lambda$ :

$(1-\lambda)P(Y) \prod_{i=1}^n P(w_i|Y) + \lambda P(Y) \prod_{i=1}^m P(b_i|Y)$ Find the best parameter

$\lambda$ that gives the highest classification accuracy. Report the optimal parameter

$\lambda$ to the nearest hundredth and report your results on the bigram model and optimal mixture model, and answer the following questions:

Running naive Bayes on the bigram model relaxes the naive assumption of the model a bit. However, is this always a good thing? Why or why not?
What would happen if we did an $N$ -gram model where $N$ was a really large number?

Provided Code Skeleton

We have provided ( tar zip) all the code to get you started on your MP, which means you will only have to implement the logic behind naive bayes.

reader.py - This file is responsible for reading in the data set. It takes in all of the emails, splits each of those emails into a list of words, stems them if you used the --stemming flag, and then stores all of the lists of words into a larger list. (so that each individual list of words corresponds with a single email)
mp4.py - This is the main file that starts the program, and computes the accuracy, precision, recall, and F1-score using your implementation of naive Bayes.
naive_bayes.py This is the file where you will be doing all of your work. The function naiveBayes() takes as input the training data, training labels, development set, and a smoothing parameter. The training data provided is the output from reader.py. The training labels is the list of labels corresponding to each email in the training data. The development set is the list of emails that you are going to test your implementation on. The smoothing parameter is the laplace smoothing parameter you specified with --laplace (it is 1 by default). You will have naiveBayes() output the predicted labels for the development set from your naive Bayes model.

Do not modify the provided code. You will only have to modify naive_bayes.py.

To understand more about how to run the MP, run python3 mp4.py -h in your terminal.

Report Checklist

Your report should briefly describe your implemented solution and fully answer the questions for every part of the assignment. Your description should focus on the most "interesting" aspects of your solution, i.e., any non-obvious implementation choices and parameter settings, and what you have found to be especially important for getting good performance. Feel free to include pseudocode or figures if they are needed to clarify your approach. Your report should be self-contained and it should (ideally) make it possible for us to understand your solution without having to run your source code.

WARNING: You will not get credit for any solutions that you have obtained, but not included in your report!

For example, if you implement the bigram model, and you do not mention it in your report, then you will not receive extra credit for it.

Your report must be a formatted pdf document. Pictures and or example outputs should be incorporated into the document. Exception: items which are very large or unsuitable for inclusion in a pdf document (e.g. videos or animated gifs) may be put on the web and a URL included in your report.

For full credit, in addition to the algorithm descriptions, your report should include the following.

State the accuracy, recall, and F1 scores on both the training and development sets.
State your optimal Laplace smoothing parameter, and how you chose it. Mention other possible parameters you tried and the impact it had on your classification accuracy.

Additionally, provide answers to the following questions:

We have said that this algorithm is called "naive" Bayes. What exactly is so naive about it?
Naive Bayes can classify spam quite nicely, but can you imagine classification problems for which naive Bayes performs poorly? Give an example, and explain why naive Bayes may perform poorly.

Extra credit

We reserve the right to give bonus points for any challenging or creative solutions that you implement. This includes, but is not restricted to, the extra credit suggestion given above.