CS440/ECE448 Fall 2018
Assignment 4: Application of Naive Bayes
Due date: Monday October 22nd, 11:55pm
|
Updated By: Justin Lizama and Medhini Narasimhan
Spam emails are a common issue we all encounter! In this assignment, you will
use the Naive Bayes algorithm to train a Spam classifier with a dataset of
emails that is provided to you. The task is to learn a bag of words (unigram)
model that will classify an email as Spam depending on the words present in it.
Contents
General guidelines
Basic instructions are the same as in MP 1. To summarize:
Your moodle submission should include:
- Your report (formatted pdf).
- A copy of naive_bayes.py containing all your new code
- (Groups only) Statement of individual contribution.
Extra credit may be awarded for going beyond expectations or completing the suggestions below. Notice, however, that the score for each MP
is capped at 110%.
You are given a dataset consisting of Spam and Ham (not spam) emails.
Using the training set, you will learn a Naive Bayes classifier that will predict the right class
label given an unseen email. Use the development set to test the accuracy of your learned model.
Report the accuracy, recall, and F1-Score that you get on your development set.
We will have a separate (unseen) test set that we will use to run your code after you turn it in.
You may use Numpy in this MP to to program your solution. Aside from that library,
no other outside non-standard libraries can be used.
The dataset consists of 500 Spam and 500 Ham (not spam) emails.
We have split this data set for you into 250 development examples and 750 training examples.
Both the training set and development set have an even number of spam vs ham examples.
That is, training set has 375 spam and 375 ham, and development set has 125 spam and 125 ham examples.
The data set can be downloaded here:
zip
tar.
The dataset used for this assignment is a subset of the Enron-Spam dataset,
provided by Ion Androutsopoulos.
The bag of words model in NLP is a simple unigram model which considers a text to
be represented as a bag of independent words. That is, we ignore the position
the words appear in, and only pay attention to their frequency in the text.
Here each email consists of a group of words. Using Bayes theorem, you need to
compute the probability of an email being Spam given the words in the email.
Thus you need to estimate the posterior probabilities:
P(Type=Spam|Words)=P(Type=Spam)P(Words)∏All wordsP(Word|Type=Spam)
It is standard practice to use the log probabilities so as to avoid underflow.
Also, P(words) is just a constant, so it will not affect your computation.
Lastly, we will be ignoring the priors P(Type=Spam) for this problem. This is because values
Spam and Ham have the same prior probability because the
data set is split evenly between the two. Therefore, we can ignore it and use MLE opposed to MAP.
With all of these facts in mind you should instead compute:
∑All wordslogP(Word|Type=Spam)
Part 1: Training and Development
Before starting: Make sure you install the nltk package with 'pip install nltk'.
Otherwise, the provided code will not run. More information about the package
can be found here
- Training: To train the algorithm you are going to need to build
a bag of words model using the emails. After you build the model, you will
need to estimate the log-likelihoods logP(Word|Type=Spam).
The variable Type can only take on two values, Spam or Ham.
Additionally, you will need to make sure you smooth the likelihoods to prevent
zero probabilities. In order to accomplish this task, use Laplace smoothing.
- Development: After you have computed the log-likelihoods, you will
have your model predict whether or not the emails from the development set are spam.
In order to do this, you will do maximum likelihood estimation classification using
the sum of log probabilities shown above.
Use only the training set to learn the individual probabilities.
Report the following results:
- Accuracy, recall, and F1 scores on both the training and development sets.
- The best Laplace smoothing parameter α
Part 2: Stemming
Try using the NLTK porter stemming package
to stem the words as a pre-processing step. That is, when running mp4.py add the
--stemming flag. Report any changes you observe in the final accuracy and also
a possible explanation of these changes.
Extra Credit Suggestion
Implement the naive Bayes algorithm over a bigram model as opposed to the unigram model. Then combine the bigram model and the unigram model into a mixture model defined with parameter λ:
(1−λ)P(Y)n∏i=1P(wi|Y)+λP(Y)m∏i=1P(bi|Y)
Find the best parameter λ that gives the highest classification accuracy. Report the optimal parameter λ to the nearest hundredth and report your results on the bigram model and optimal mixture model, and answer the following questions:
- Running naive Bayes on the bigram model relaxes the naive assumption of the model a bit. However, is this always a good thing? Why or why not?
- What would happen if we did an N-gram model where N was a really large number?
Provided Code Skeleton
We have provided
( tar
zip)
all the code to get you started on your MP, which
means you will only have to implement the logic behind naive bayes.
- reader.py - This file is responsible for reading in the data set.
It takes in all of the emails, splits each of those emails into a list of words,
stems them if you used the --stemming flag, and then stores all of the lists of
words into a larger list. (so that each individual list of words corresponds with a single email)
- mp4.py - This is the main file that starts the program, and computes the
accuracy, precision, recall, and F1-score using your implementation of naive Bayes.
- naive_bayes.py This is the file where you will be doing all of your work.
The function naiveBayes() takes as input the training data, training labels, development set,
and a smoothing parameter. The training data provided is the output from reader.py.
The training labels is the list of labels corresponding to each email in the training data.
The development set is the list of emails that you are going to test your implementation on.
The smoothing parameter is the laplace smoothing parameter you specified with --laplace (it is 1 by default).
You will have naiveBayes() output the predicted labels for the development set from your naive Bayes model.
Do not modify the provided code. You will only have to modify naive_bayes.py.
To understand more about how to run the MP, run python3 mp4.py -h in your terminal.
Your report should briefly describe
your implemented solution and fully answer the questions for every part of the assignment. Your description
should focus on the most "interesting" aspects of your solution, i.e., any non-obvious
implementation choices and parameter settings, and what you have found to be especially
important for getting good performance. Feel free to include pseudocode or figures if
they are needed to clarify your approach. Your report should be self-contained and it should
(ideally) make it possible for us to understand your solution without having to run your source code.
WARNING: You will not get credit for any solutions that you have obtained,
but not included in your report!
For example, if you implement the bigram model, and you do not mention it in your report, then you will not receive extra credit for it.
Your report must be a formatted pdf document.
Pictures and or example outputs
should be incorporated into the document.
Exception: items which are very large or unsuitable for inclusion in a pdf document
(e.g. videos or animated gifs) may be put on the web and a URL included in your report.
For full credit, in addition to the algorithm descriptions, your report should include the following.
- State the accuracy, recall, and F1 scores on both the training and development sets.
- State your optimal Laplace smoothing parameter, and how you chose it. Mention other possible parameters you tried and the impact it had on your classification accuracy.
Additionally, provide answers to the following questions:
- We have said that this algorithm is called "naive" Bayes. What exactly is so naive about it?
- Naive Bayes can classify spam quite nicely, but can you imagine classification
problems for which naive Bayes performs poorly? Give an example, and explain why naive Bayes may perform poorly.
Extra credit
We reserve the right to give bonus points for any challenging or creative solutions that you implement.
This includes, but is not restricted to, the extra credit suggestion given above.
|