CS 126 - Software Design Studio

Naive Numbers

Assigned: 2019-10-23
Due Date: 2019-10-29 by 11:59PM U.S. Central Time

Essentially, all models are wrong, but some are useful.

— Attributed to statistician George E. P. Box

You're not quite sure, but you're beginning to suspect that Minisoft is beginning to change from its original aspirations as an indie game development company. The deal with Macrohard was a huge success, and now your managers have started using trendy buzzwords in their conversations. "Hey, does our product synergize with the cloud?" "I don't know, let me ask my Agile team to build a blockchain with machine learning and we'll find out!"

It looks like machine learning is the future! So, to stay ahead of the curve, you'll make your own classifier and show your managers why you deserve the big bucks.

Assignment Spec

Handwriting recognition is the ability of a computer to interpret hand written text as the characters. In this assignment we will be trying to recognize numbers from images using a Naive Bayes classifier.

Naive Bayes classifiers are a family of probabilistic classifiers that are based on Bayes’ theorem. These algorithms work by combining the probabilities that an instance belongs to a class based on the value of a set of features. In this case, we will be testing if an image belongs to the class of the digits 0 to 9 based on the state of the pixels in the images.

It's not a requirement that you understand the math to complete this assignment, but learning a bit more about it might help.

Getting Started

Get your copy of the assignment repo here: https://classroom.github.com/a/hF78NrvE

We've changed the format of the CMake project a bit:

We've declared a library, libbayes, where most of your code will probably go. libbayes/src corresponds to the source code of this library, and libbayes/test corresponds to the tests for this library.
main.cpp and libbayes/test/main.cpp both link to this library, so any code that gets added to libbayes will be included in both executables.
You don't have to add new .cpp files to both executables anymore if you add them to libbayes. You can still add individual .cpp files or headers to either executable as you please.

Part 0: The Data Files

For this assignment you will also need a set of pre-labeled data files that we will use for training and validation data, which you can download here.

The .zip file contains a few text files:

readme.txt: a description of the files in the .zip
testimages: 1000 images for validation/testing purposes
testlabels: the correct classification for each testing image
trainingimages: 5000 images for model training purposes
traininglabels: the correct classification for each training image

Do not commit these files to your Git repo! Instead, make sure they're blacklisted using the .gitignore. If do you accidentally commit them to your repo, you should remove them with git rm -r --cached.

You're going to build a Naive Bayes classification model that can classify unknown images as a number 0-9. Each image consists of 28 lines of 28 ASCII characters, in other words, they're 28x28. An individual character in an image will either be ' ' (a single space) for a white pixel, a '+' for a gray pixel, or a '#' for a black pixel. For simplicity's sake we're going to assume that white pixels are "not shaded" (0) and gray or black pixels are "shaded" (1).

Each image has a classification in the corresponding labels file; one classification per line.

Design hint: you might find your code from the previous assignment for reading in files to be useful, as well as using operator overloading for certain classes.

You may assume that the given training/testing data is valid. A more robust program would probably handle the case where the given data is malformed, but that's not a requirement of this assignment.

Part 1: Training the Classifier

So, we have a set of 28x28 images, and a set of corresponding labels that tell us what number 0-9 that image is. What do we do with this information?

Bayes' theorem is a statistical formula that describes the conditional probability that an event A will happen, given that some other event B already happened. It looks like this:

P(A|B) = \frac{P(B|A) P(A)}{P(B)}

Or, "the probability that A will happen given that B happened, is the probability that B will happen given that A happened, times the probability that A will happen (independently), all divided by the probability that B will happen (independently)."

The goal of the training stage is to teach the computer the likelihood that a pixel $F_{ij}$ is either shaded (1) or not shaded (0) for each one of the classes $c \in \{0, 1, ..., 9\}$ . In other words, we want be able to answer "what is the probability that a given pixel is shaded, given that it's supposed to be 0-9"?

You'll need to make a model that can answer questions like, "what is the probability that pixel (3, 17) is shaded when we were only looking at images that were a 7?" We've given you some starter code to help you out with this, though you're allowed (and encouraged!) to delete/modify/rename/move this code as you wish.

You can calculate this number like this (f represents "is shaded" or "is not shaded" as necessary):

P(F_{ij} = f|class = c) = \frac{\text{\# of times }F_{ij} = f \text{ when }class = c}{\text{Total \# of training examples where }class = c}

Given how we will use these probabilities we need to ensure that they are not zero, since if they are, it will cause them to cancel out any non-zero probability from other features. To do this we will use a technique called Laplace Smoothing. This technique works by adding a small positive value k to the numerator above, and k·V to the denominator (where V is the number of possible values (shaded or not shaded) our feature can take on, so 2). The higher the value of k, the stronger the smoothing is. So for our binary features that would give us:

P(F_{ij} = f|class = c) = \frac{k + \text{\# of times }F_{ij} = f \text{ when }class = c}{2k + \text{Total \# of training examples where }class = c}

You can experiment with different values of k (say, from 0.1 to 10) and find the one that gives the highest classification accuracy.

You also need to compute the priors, or the probability of each class 0-9 occurring independently of any features in the images. This is a simple calculation:

P(class = c) = \frac{\text{\# of training samples where }class = c}{\text{Total \# of training samples}}

In other words, being able to answer questions like, "if I grab a random image, what's the probability that I'll get a 9?"

With the training done you will have an empirically generated set of probabilities that can be referred to as the model, which we will use to do classification on unknown data. This model, once generated, can be used in the future without needing to be regenerated; this is why we will require you to be able to save and load a calculated model to keep the runtime costs low.

Note: Do not train on the testing images! They are for validating the accuracy of your model only.

Testing hint: We have provided you with a tiny sample set of data in your libbayes/test/resources/ folder that you can use to make sure your code is working!

Part 2: Classification

To classify the unknown images using the trained model you will perform maximum a posteriori (MAP) classification of the test data using the trained model. This technique computes the posterior probability of each class for the particular image, and then classifies the image as belonging to the class which has the highest posterior probability. Since we assume that all the probabilities are all independent of each other (this is the "naive" part in Naive Bayes), we can compute the probability that the image belongs to the class by simply multiplying all the probabilities together and dividing by the probability of the feature set. Finally, since we don’t actually need the probability but merely to be able to compare the values we can ignore the probability of the feature set and thus compute the value as follows:

P(class) * P(f_{1,1}|class) * P(f_{1,2}|class) * ... * P(f_{28,28}|class)

There's a problem with doing this, though. These probabilities are all going to be less than 1.0, which means if you multiply them all together they're going to get really small really fast and you'll run into the problem of Arithmetic Underflow. Since we don't actually care about the value (only which one is the largest), we can get around this by taking the log and adding instead of multiplying, so instead, we get:

log(P(class)) + log(P(f_{1,1}|class)) + log(P(f_{1,2}|class)) + ... + log(P(f_{28,28}|class))

where $f_{ij}$ is the feature value of the input data. You can use the priors and probabilities from your model to calculate these posterior probabilites (look at Bayes' theorem!)

Once you have a posterior probability for your image for each class from 0-9 you should classify the image as the class with the highest posterior probability. For example for an input P, if we had the posterior probabilities:

class | posterior probability
------+----------------------
 0    | 0.3141
 1    | 0.432
 2    | 0.0
 3    | 0.0
 4    | 0.4
 5    | 0.004
 6    | 0.1
 7    | 0.2
 8    | 0.7
 9    | 0.5

we'd classify P as an 8, because that's the class with the highest posterior probability.

Part 3: Validation

What good is a classifier if you don't know how accurate it is? We've given you a set of testingimages and testinglabels; use your program to classify each of the testingimages with your generated model and compare the result to the actual label, and print out the accuracy of your Naive Bayes classifier.

Grading and Deliverables

You must have a main.cpp that can:
- Read in training data and produce a computed model, saving the model as a file
- Or, load in an already-generated model from a file and then classify a set of images, while also reporting the accuracy of your classifer
Here's a made-up example of the kind of things we'd like to be able to do with your program:
```
# Compute a model from trainingimages
./naivebayes trainingimages traininglabels

# Use the computed model to classify some new, unknown images and use the labels check the accuracy
./naivebayes mycomputedmodel testingimages testinglabels
```
These names are made up and the actual order/method of handling these command line arguments doesn't matter, just implement some kind of command-line interface that allows you to train and classify. You could even split up the training and classification steps into two separate executables if you wanted! The sky's the limit.
You must not commit the training and validation data to your repo
You must follow the Google C++ Style Guide with regards to naming and whitespace.
You must have a set of unit tests for basic features of your classifier (Does your classifier hit a certain percentage of accuracy? Does it behave as expected for small sets of test data? etc.)

Confused?

Here's a worked example of Naive Bayes classification, written by a member of staff: https://courses.engr.illinois.edu/cs126/sp2019/assignments/naivebayes-example.pdf

Here is last semester's documentation, which is more thorough: https://courses.physics.illinois.edu/cs126/sp2019/assignments/naivebayes.pdf