Assigned: 2020-03-25
Due Date: 2020-03-31 by 11:59PM U.S. Central Time
You can find a worked example of Naive Bayes classification, written by a member of staff, here.
Here is last year's documentation, which is more thorough: https://courses.physics.illinois.edu/cs126/sp2019/assignments/naivebayes.pdf.
Handwriting recognition is the ability of a computer to interpret hand written text as the characters. In this assignment we will be trying to recognize numbers from images using a Naive Bayes classifier.
Naive Bayes classifiers are a family of probabilistic classifiers that are based on Bayes’ theorem. These algorithms work by combining the probabilities that an instance belongs to a class based on the value of a set of features. In this case, we will be testing if an image belongs to the class of the digits 0 to 9 based on the state of the pixels in the images.
It's not a requirement that you understand the math to complete this assignment, but learning a bit more about it might help.
You will need to submit your GitHub repository to Gradescope. There is a linter which is run on each submission; the results of the linter will be viewable by your code moderator when grading.
Get your copy of the assignment repo here: https://classroom.github.com/a/CD4zg1Uo.
The rubric can be found here.
We've changed the format of the CMake project a bit:
gflags
to help you parse command-line arguments.
See the apps/classify.cc
file for an example.
Using this library is not required.nlohmann/json
to help you (de)serialize your models
if you decide to use JSON. Again, using this library is not required.test-bayes
. Make sure
you only define CATCH_CONFIG_MAIN
in one test file.For this assignment you will also need a set of pre-labeled data files that we will use for training and validation data, which you can download here.
The .zip
file contains a few text files:
readme.txt
: a description of the files in the .zip
testimages
: 1000 images for validation/testing purposestestlabels
: the correct classification for each testing imagetrainingimages
: 5000 images for model training purposestraininglabels
: the correct classification for each training
imageDo not commit these files to your Git repo! Instead, make sure
they're blacklisted using the
.gitignore
. If do you
accidentally commit them to your repo, you should remove them with
git rm -r --cached
.
You're going to build a Naive Bayes classification model that can
classify unknown images as a number 0-9
. Each image consists of 28
lines of 28 ASCII characters, in other words, they're 28x28. An
individual character in an image will either be ' '
(a single space)
for a white pixel, a '+'
for a gray pixel, or a '#'
for a black
pixel. For simplicity's sake we're going to assume that white pixels
are "not shaded" (0
) and gray or black pixels are "shaded" (1
).
Each image has a classification in the corresponding labels
file;
one classification per line.
Design hint: you might find your code from the previous assignment for reading in files to be useful, as well as using operator overloading for certain classes.
So, we have a set of 28x28 images, and a set of corresponding labels
that tell us what number 0-9
that image is. What do we do with this
information?
Bayes' theorem is a statistical formula that describes the conditional probability that an event A will happen, given that some other event B already happened. It looks like this:
Or, "the probability that A will happen given that B happened, is the probability that B will happen given that A happened, times the probability that A will happen (independently), all divided by the probability that B will happen (independently)."
The goal of the training stage is to teach the computer the likelihood
that a pixel is either shaded (1
) or not
shaded (0
) for each one of the classes . In other words, we want be able to answer "what is the
probability that a given pixel is shaded, given that it's supposed to
be 0-9"?
You'll need to make a model that can answer questions like, "what is
the probability that pixel (3, 17)
is shaded when we were only
looking at images that were a 7
?" We've given you some starter code
to help you out with this, though you're allowed (and encouraged!) to
delete/modify/rename/move this code as you wish.
You can calculate this number like this (f represents "is shaded" or "is not shaded" as necessary):
Given how we will use these probabilities we need to ensure that they are not zero, since if they are, it will cause them to cancel out any non-zero probability from other features. To do this we will use a technique called Laplace Smoothing. This technique works by adding a small positive value k to the numerator above, and k·V to the denominator (where V is the number of possible values (shaded or not shaded) our feature can take on, so 2). The higher the value of k, the stronger the smoothing is. So for our binary features that would give us:
You can experiment with different values of k (say, from 0.1
to
10
) and find the one that gives the highest classification accuracy.
You also need to compute the priors, or the probability of each
class 0-9
occurring independently of any features in the
images. This is a simple calculation:
In other words, being able to answer questions like, "if I grab a
random image, what's the probability that I'll get a 9
?"
With the training done you will have an empirically generated set of probabilities that can be referred to as the model, which we will use to do classification on unknown data. This model, once generated, can be used in the future without needing to be regenerated; this is why we will require you to be able to save and load a calculated model to keep the runtime costs low.
Note: Do not train on the testing images! They are for validating the accuracy of your model only.
Testing hint: We have provided you with a tiny sample set of data in
your tests/data/
folder that you test the behavior of your code!
To classify the unknown images using the trained model you will perform maximum a posteriori (MAP) classification of the test data using the trained model. This technique computes the posterior probability of each class for the particular image, and then classifies the image as belonging to the class which has the highest posterior probability. Since we assume that all the probabilities are all independent of each other (this is the "naive" part in Naive Bayes), we can compute the probability that the image belongs to the class by simply multiplying all the probabilities together and dividing by the probability of the feature set. Finally, since we don’t actually need the probability but merely to be able to compare the values we can ignore the probability of the feature set and thus compute the value as follows:
There's a problem with doing this, though. These probabilities are all
going to be less than 1.0
, which means if you multiply them all
together they're going to get really small really fast and you'll run
into the problem of Arithmetic
Underflow. Since
we don't actually care about the value (only which one is the
largest), we can get around this by taking the log
and adding
instead of multiplying, so instead, we get:
where is the feature value of the input data. You can use the priors and probabilities from your model to calculate these posterior probabilites (look at Bayes' theorem!)
Once you have a posterior probability for your image for each class
from 0-9
you should classify the image as the class with the highest
posterior probability. For example for an input P, if we had the
posterior probabilities:
class | posterior probability
------+----------------------
0 | 0.3141
1 | 0.432
2 | 0.0
3 | 0.0
4 | 0.4
5 | 0.004
6 | 0.1
7 | 0.2
8 | 0.7
9 | 0.5
we'd classify P as an 8, because that's the class with the highest posterior probability.
What good is a classifier if you don't know how accurate it is? We've
given you a set of testingimages
and testinglabels
; use your
program to classify each of the testingimages
with your generated
model and compare the result to the actual label, and print out the
accuracy of your Naive Bayes classifier.
Any non-trivial enhancements and/or additional classification algorithms will be awarded extra credit. Here are some ideas that might interest you:
k
-Nearest NeighborsFeel free to use Machine Learning libraries for the extra credit portion.
You must have an executable that can both:
Here's a made-up example of the kind of things we'd like to be able to do with your program:
# Compute a model from trainingimages ./classify --train \ --images=data/trainingimages \ --labels=data/traininglabels # Use the computed model to classify some new, unknown images and use the labels check the accuracy ./classify \ --model=data/<your-computed-model> \ --images=data/testingimages \ --labels=data/testinglabels
You should go to Run > Edit Configurations
in order to change the program
arguments.
These names are made up and the actual order/method of handling these command line arguments doesn't matter, just implement some kind of command-line interface that allows you to train and classify. The sky's the limit.
You must not commit the training and validation data to your repo
You must follow the Google C++ Style Guide with regards to naming and whitespace.
You must have a set of unit tests for basic features of your classifier (Does your classifier hit a certain percentage of accuracy? Does it behave as expected for small sets of test data? etc.)