CS 126 - Software Design Studio

Naive Bayes

Assigned: 2020-10-06
Due Date: 2020-10-13 by 11:59PM U.S. Central Time

Why did the naive Bayesian suddenly feel patriotic when he heard fireworks?
He assumed independence.

— Some dude on Quora, idk: https://www.quora.com/What-are-some-good-machine-learning-jokes

Goals

Parsing image and label data from the given files
Using classes, structs, and namespace appropriately in C++
Operator overloading in C++
Working with streams in C++
Parsing command-line arguments

Background

Handwriting recognition is the ability of a computer to interpret hand-written text as unicode characters. In this assignment, we will be attempting to recognize numbers from images using a Naive Bayes classifier.

Naive Bayes classifiers are a family of probabilistic classifiers that are based on Bayes’ theorem. These algorithms work by combining the probabilities that an instance belongs to a class based on the value of a set of features. In this case, we will be classifying images of handwritten digits; each image depicts a number between 0 to 9, inclusive.

There will be workshops helping you understand the assignment and various concepts related to it (such as operator overloading, command line arguments, and the math behind Naive Bayes), so we strongly suggest that you attend them.

Gradescope

You will need to submit your GitHub repository to Gradescope.

Getting Started

Get your copy of the assignment here.

Make sure that you have completed the cinder setup instructions from this campuswire post for setting up cinder.

Create a directory called my-projects in your ~/Cinder folder. Navigate to the my-projects folder and clone your repository within it, and then open the repository using CLion. You should be able to run three different targets; make sure that all three of them work. Important: your naivebayes repo should be in the following location: /path/to/cinder/my-projects/naivebayes-your-username. Again, you should be cloning your repository within my-projects rather than renaming your repo to my-projects. If your naivebayes repo is one directory above or below where it should be, you will not be able to build your project and will most likely get a "include could not find load file" error.

MacOS users: It’s possible that you may come across a Permission denied error. In this case, try doing: Edit configurations -> Executable -> Select Other -> Find my-projects/<repo_name>/cmake-build-debug/Debug/<target_name>/Contents/MacOS/<target_name>, and then click run again.

You should make a branch called week1 in your repository, and only commit to this branch. Nothing should be committed to master. When you’re finished with week 1, you can open a pull request to request that your code be merged into master, and this pull request will be reviewed by your code moderator. (Then you’ll make a branch called week2 in the second week of the assignment, and go through the same process.)

We've changed the format of the CMake project a bit:

The CMake test configuration has been enhanced to run all of your tests from all of your test files. The target name is naive-bayes-test. Make sure you only define CATCH_CONFIG_MAIN in a single file (test_main.cpp), otherwise there will be multiple main methods for your test target.

We’ve provided starter code for you. Please rename the files we asked you to rename. Note that anything related to a sketchpad (the provided code) or Cinder will be used for week 2. You don’t have to worry about it this week, but we suggest taking a look at it when thinking about how to design your code for week 1. You should only be writing code in include/core, src/core, and tests during week 1.

The Data Files

For this assignment, you will need a set of pre-labeled data files that we will use for training, which you can download here. The .zip file contains a few text files:

readme.txt: a description of the files in the .zip
trainingimages: 5000 images for model training purposes
traininglabels: the correct classification for each training image

Do not commit these files to your Git repo! Instead, make sure they're blacklisted using the .gitignore. If you do accidentally commit them to your repo, you should remove them with git rm -r --cached <filename>.

You're going to build a Naive Bayes classification model that can be used to classify unknown images as numbers from 0 to 9. In the file, each image is represented by 28 lines of 28 ASCII characters, so they contain 28 x 28 pixels. An individual character in an image will either be ' ' (a single space) for a white pixel, a '+' for a gray pixel, or a '#' for a black pixel. For simplicity, we suggest assuming that white pixels are "not shaded" and gray or black pixels are "shaded;” in other words, we suggest treating the gray and black pixels in the same way. For extra credit, you can differentiate between gray and black pixels when building your classifier.

Each image has a classification in the corresponding labels file; for example, the fourth 28 x 28 image’s correct classification (from trainingimages) would be the fourth number in traininglabels.

Flexibility in Image Size

Your model should be able to train on arbitrary image sizes. For example, if we gave you a file of 5 x 5 images, your model should be able to handle that without any reconfiguration. You may assume that all of the images you receive for a single training session are of the same size. Also, your model only needs to be able to classify images that are of the same size as those in the training data. This flexibility, with the provided caveats, will come in handy when you’re writing tests.

Part 0: Testing

So how do you know if your model is doing what it’s supposed to? You have to test it! Remember that we emphasize test-driven development in this class: writing tests first can help you understand what code you should write and which edge cases you should consider.

For this assignment, testing is particularly crucial — since you have not yet implemented the final classification step of Naive Bayes, you will not have any feedback on the correctness of your code without thorough testing. Your tests should include (but should not be limited to):

Operator overloading
Processing images of different sizes
Mathematical correctness of probabilities provided by your model
Saving and loading the model

In order to do this, you should create your own dataset containing a smaller number of smaller images. This allows you to compute the expected outputs by hand and utilize those in your test cases.

Another helpful tip: Comparing that two floating point numbers are exactly equal is unreliable and likely to fail. In order to deal with this, you may find this Catch2 feature useful: https://github.com/catchorg/Catch2/blob/master/docs/assertions.md#floating-point-comparisons

Part 1: Operator Overloading

In this assignment, you'll have your first exposure to operator overloading in C++; specifically, you’ll be overloading the >> operator to support reading in data.

You will be using file streams to read from the provided data files, and you will need to overload std::istream &operator>> to support reading in the set of training images. The reason why we prefer overloading >> rather than just creating a method that accepts the filename string as a parameter is that it’s more flexible. With this design, you should be able to support receiving data from arbitrary sources, like network connections, although that's not something you have to worry about for this assignment.

When implementing this section, keep in mind that you’ll need to be able to support arbitrary image sizes.

Note: If you need more help understanding operator overloading, we strongly recommend attending the C++ workshops.

Part 2: Training the Model

For the math behind this assignment, please refer to this document

We expect you to develop a model that stores the feature and prior probabilities computed from our provided training files. Please note that this is only one of your deliverables; see the end of this document for a complete list.

Part 3: Saving and loading data

During the training stage, your program will have to read and process a total of 28² * 5000 = 3.92 million characters; in other words, there’s a large computational cost associated with training. Regenerating our model every time we wanted to classify something is inefficient. To keep the runtime costs low, we will require you to be able to save a trained model to a file, and load that file back into a model at a later time -- hint: you may want to review how operator overloading works.

Here’s some general advice: it’ll be useful to ask yourself: what data is essential to the state of my model and how can I represent it? Then, if you just write all of this information into a file, you should be able to read the information from that file at a later time. Remember that the file containing the model does not have to be human-readable -- it just needs to be able to be parsed by your program when loading in the model from a file.

Part 4: Creating an executable that accepts command line arguments

Instead of hardcoding the file paths of the training images, training labels, and location to save the model, you should allow the user to specify these file paths via command-line arguments. (In CLion, you can go to Run > Edit Configurations in order to change the program arguments).

You will need to process command line arguments and fail gracefully if the user inputs an incorrect number of command line arguments. We expect your code to handle the following using command line parsing:

Take in the paths to the training images and labels
Allow the user to save the model to a file of their choice
Allow the user to load the model from a file of their choice

For example, you could implement an interface that supports the following command line arguments to train the model using given training images and training label files and to save the model to a specific file:

./train_model train data/mnistdatatraining/trainingimages data/mnistdatatraining/traininglabels save data/saved_model

You may use any command line parsing library you want or implement your own parsing library (implementation hint: you may want to think about what your code should do if the user provides incorrect command line arguments). Notice that the example we have above is quite simple and requires the filenames to be in a specific order; you may want to look into the following command line parsing libraries if you want a more robust interface.

Grading and Deliverables

You must overload std::istream& operator>> to read in data from files.
You must create a model that can be trained from the provided data.
You must support saving the trained model to a file and loading it from an existing file
You must allow the user to select a file to read the images and labels from (using command-line arguments), and you must make sure this file is valid.
You must use Catch2 to test as much functionality as you can.
You must not commit the training data to your repository.
You must follow the Google C++ Style Guide with regards to naming and whitespace.

Assignment Rubric

This rubric is not a comprehensive checklist. Please make sure to go over the feedback you received on your previous MPs and ask your moderator/post on Campuswire if you have any questions.

Also, be sure to read over the Google C++ Style Guide here -- some parts, especially with respect to naming conventions, are quite different from what you’re used to and you will be marked off if you don’t follow the style guide.

Click here to view

C++

Headers have inclusion guards (#ifndef or #pragma once)
All method specifications are in the header class, and operational line comments are left in the source file.
Code has appropriate namespacing
Usage of const and references
- Instance methods marked as const where appropriate
- Parameters passed as const reference where appropriate
- For-each loop variables declared as const reference where appropriate
size_t used in lieu of int where appropriate
Avoid using namespace std to avoid naming collisions. Instead, only use specific things, e.g. using std::string
Proper memory management
- If you do decide to use new, remember that every call to new must be matched by a call to delete. Remember that non-built-in types can be allocated on the stack (without using ‘new’ in C++)
- Avoid returning a pointer/reference to memory on the stack in a function (this is called a “dangling pointer” memory error)
- Initialize built-in types allocated on the stack (ex: int x = 0; rather than int x;) before using them
Appropriate usage of structs, classes, and namespaces

Readability and flexibility of code

Modularity: each method should perform one distinct task
It should be easy to read through each method, follow its control flow, and verify its correctness
The code should be flexible/ready for change (no magic numbers, no violations of DRY)

Object decomposition

Member variables stored by each class
- Classes should store their data using the most intuitive data structure
- No "missing" member variables
- No member variables which should be local variables
- No redundancy / storing multiple copies of the same data in different formats
Encapsulation
- Appropriate access modifiers
- Member variables should generally only modified by member functions in the same class
- The interface of a class should be intuitive/abstract, and external code should only interact with the class via the interface
  - By intuitive, we mean that it should be easy to understand and use the class, and there shouldn’t be any hidden assumptions about how the class should be used
  - By abstract, we mean that an external client shouldn’t need to worry about the internal details of the class
No unnecessary getters/setters (exception: you may make getters and setters for testing purposes)

Documentation

Specifications
- Specifications are required for all functions which are part of the public interface of a class
- Specifications should precisely describe the inputs and outputs of a function, and should also describe what the function does (e.g. mutating state of object)
- Specifications should also be formatted properly
Inline comments should not describe things which are obvious from the code, and should describe things which need clarification

Naming

Semantics: names should effectively describe the entities they represent; they should be unambiguous and leave no potential for misinterpretation. However, they should not be too verbose.
Style: names should follow the Google C++ Style Guide

Layout

Spacing should be readable and consistent; your code should look professional
Vertical whitespace should be meaningful
Vertical whitespace can help create paragraphs
Having 2+ empty lines in a row, or empty lines at the beginning or end of files, is usually a waste of space and looks inconsistent
Horizontal whitespace should be present where required by the Google Style Guide
Lines should all be under 100 characters; no horizontal scrolling should be necessary

Testing

You should make sure all classes of inputs and outputs are tested.
Boundary/edge cases often cause different/unexpected behavior, and thus, they should be tested
Your tests should cover all of the functionality that you’ve implemented. In other words, every line of code should be exercised by some test case, unless the assignment documentation says otherwise
- You should be testing for correctness. Testing whether your model has an accuracy over a certain threshold does not guarantee that your model is correct.
- Test every non-trivial public method with all possible classes of input
Each individual test case should only serve one coherent purpose. Individual test cases should not have assertions testing unrelated things
Your tests, like your code, should be organized and easy to understand. This includes:
- Easy to verify thoroughness / all possibilities covered
- Easy to verify the correctness of each test case
- Clear categories of test cases, where similar tests are grouped together
Test case descriptions make the purpose of each test case clear
Appropriate usage of SECTION and TEST_CASE to organize your code

Process

Commit modularity
- Code should be checked-in periodically/progressively in logical chunks
- Unrelated changes should not be bundled in the same commit
- Do not treat commits as save points
Commit messages
- Should concisely and accurately describe the changes made
- Should have a consistent style and look professional
- First word of the message should be a verb, and it should be capitalized
- Commit message header should be no longer than 50 characters; in general, if you find the need to use “and” to group multiple unrelated descriptions together, you should break your commit message up

Presentation

Arrived on time with all necessary materials and ready to go
Good selection of topics to focus on
Logical order of presentation
Appropriate pacing and engagement of the fellow students
Speaking loud enough and enunciating clearly

Participation

Each student should contribute at least one meaningful comment or question for every other student who presents in his/her code review
Students must behave respectfully to moderator and other students

Weightings

Your grades for each section of the rubric will be weighted as follows:

C++ (10%)
Readability and flexibility of code (15%)
Object decomposition (20%)
Documentation (7.5%)
Naming (5%)
Layout (5%)
Testing (20%)
Process (7.5%)
Presentation (5%)
Participation (5%)