Assigned: 2020-10-06
Due Date: 2020-10-13 by 11:59PM U.S. Central Time
Why did the naive Bayesian suddenly feel patriotic when he heard fireworks?
He assumed independence.
— Some dude on Quora, idk: https://www.quora.com/What-are-some-good-machine-learning-jokes
Handwriting recognition is the ability of a computer to interpret hand-written text as unicode characters. In this assignment, we will be attempting to recognize numbers from images using a Naive Bayes classifier.
Naive Bayes classifiers are a family of probabilistic classifiers that are based on Bayes’ theorem. These algorithms work by combining the probabilities that an instance belongs to a class based on the value of a set of features. In this case, we will be classifying images of handwritten digits; each image depicts a number between 0 to 9, inclusive.
There will be workshops helping you understand the assignment and various concepts related to it (such as operator overloading, command line arguments, and the math behind Naive Bayes), so we strongly suggest that you attend them.
You will need to submit your GitHub repository to Gradescope.
Get your copy of the assignment here.
Make sure that you have completed the cinder setup instructions from this campuswire post for setting up cinder.
Create a directory called my-projects in your ~/Cinder folder. Navigate to the my-projects folder and clone your repository within it, and then open the repository using CLion. You should be able to run three different targets; make sure that all three of them work. Important: your naivebayes repo should be in the following location: /path/to/cinder/my-projects/naivebayes-your-username. Again, you should be cloning your repository within my-projects rather than renaming your repo to my-projects. If your naivebayes repo is one directory above or below where it should be, you will not be able to build your project and will most likely get a "include could not find load file" error.
MacOS users: It’s possible that you may come across a Permission denied error. In this case, try doing: Edit configurations -> Executable -> Select Other -> Find my-projects/<repo_name>/cmake-build-debug/Debug/<target_name>/Contents/MacOS/<target_name>
, and then click run again.
You should make a branch called week1 in your repository, and only commit to this branch. Nothing should be committed to master. When you’re finished with week 1, you can open a pull request to request that your code be merged into master, and this pull request will be reviewed by your code moderator. (Then you’ll make a branch called week2 in the second week of the assignment, and go through the same process.)
We've changed the format of the CMake project a bit:
We’ve provided starter code for you. Please rename the files we asked you to rename. Note that anything related to a sketchpad (the provided code) or Cinder will be used for week 2. You don’t have to worry about it this week, but we suggest taking a look at it when thinking about how to design your code for week 1. You should only be writing code in include/core, src/core, and tests during week 1.
For this assignment, you will need a set of pre-labeled data files that we will use for training, which you can download here. The .zip file contains a few text files:
Do not commit these files to your Git repo! Instead, make sure they're blacklisted using the .gitignore. If you do accidentally commit them to your repo, you should remove them with git rm -r --cached <filename>
.
You're going to build a Naive Bayes classification model that can be used to classify unknown images as numbers from 0 to 9. In the file, each image is represented by 28 lines of 28 ASCII characters, so they contain 28 x 28 pixels. An individual character in an image will either be ' ' (a single space) for a white pixel, a '+' for a gray pixel, or a '#' for a black pixel. For simplicity, we suggest assuming that white pixels are "not shaded" and gray or black pixels are "shaded;” in other words, we suggest treating the gray and black pixels in the same way. For extra credit, you can differentiate between gray and black pixels when building your classifier.
Each image has a classification in the corresponding labels file; for example, the fourth 28 x 28 image’s correct classification (from trainingimages) would be the fourth number in traininglabels.
Your model should be able to train on arbitrary image sizes. For example, if we gave you a file of 5 x 5 images, your model should be able to handle that without any reconfiguration. You may assume that all of the images you receive for a single training session are of the same size. Also, your model only needs to be able to classify images that are of the same size as those in the training data. This flexibility, with the provided caveats, will come in handy when you’re writing tests.
So how do you know if your model is doing what it’s supposed to? You have to test it! Remember that we emphasize test-driven development in this class: writing tests first can help you understand what code you should write and which edge cases you should consider.
For this assignment, testing is particularly crucial — since you have not yet implemented the final classification step of Naive Bayes, you will not have any feedback on the correctness of your code without thorough testing. Your tests should include (but should not be limited to):
In order to do this, you should create your own dataset containing a smaller number of smaller images. This allows you to compute the expected outputs by hand and utilize those in your test cases.
Another helpful tip: Comparing that two floating point numbers are exactly equal is unreliable and likely to fail. In order to deal with this, you may find this Catch2 feature useful: https://github.com/catchorg/Catch2/blob/master/docs/assertions.md#floating-point-comparisons
In this assignment, you'll have your first exposure to operator overloading in C++; specifically, you’ll be overloading the >> operator to support reading in data.
You will be using file streams to read from the provided data files, and you will need to overload std::istream &operator>>
to support reading in the set of training images. The reason why we prefer overloading >> rather than just creating a method that accepts the filename string as a parameter is that it’s more flexible. With this design, you should be able to support receiving data from arbitrary sources, like network connections, although that's not something you have to worry about for this assignment.
When implementing this section, keep in mind that you’ll need to be able to support arbitrary image sizes.
Note: If you need more help understanding operator overloading, we strongly recommend attending the C++ workshops.
For the math behind this assignment, please refer to this document
We expect you to develop a model that stores the feature and prior probabilities computed from our provided training files. Please note that this is only one of your deliverables; see the end of this document for a complete list.
During the training stage, your program will have to read and process a total of 282 * 5000 = 3.92 million characters; in other words, there’s a large computational cost associated with training. Regenerating our model every time we wanted to classify something is inefficient. To keep the runtime costs low, we will require you to be able to save a trained model to a file, and load that file back into a model at a later time -- hint: you may want to review how operator overloading works.
Here’s some general advice: it’ll be useful to ask yourself: what data is essential to the state of my model and how can I represent it? Then, if you just write all of this information into a file, you should be able to read the information from that file at a later time. Remember that the file containing the model does not have to be human-readable -- it just needs to be able to be parsed by your program when loading in the model from a file.
Instead of hardcoding the file paths of the training images, training labels, and location to save the model, you should allow the user to specify these file paths via command-line arguments. (In CLion, you can go to Run > Edit Configurations in order to change the program arguments).
You will need to process command line arguments and fail gracefully if the user inputs an incorrect number of command line arguments. We expect your code to handle the following using command line parsing:
For example, you could implement an interface that supports the following command line arguments to train the model using given training images and training label files and to save the model to a specific file:
./train_model train data/mnistdatatraining/trainingimages data/mnistdatatraining/traininglabels save data/saved_model
You may use any command line parsing library you want or implement your own parsing library (implementation hint: you may want to think about what your code should do if the user provides incorrect command line arguments). Notice that the example we have above is quite simple and requires the filenames to be in a specific order; you may want to look into the following command line parsing libraries if you want a more robust interface.
This rubric is not a comprehensive checklist. Please make sure to go over the feedback you received on your previous MPs and ask your moderator/post on Campuswire if you have any questions.
Also, be sure to read over the Google C++ Style Guide here -- some parts, especially with respect to naming conventions, are quite different from what you’re used to and you will be marked off if you don’t follow the style guide.
Your grades for each section of the rubric will be weighted as follows: