Info Assignments Office Hours Hall of Fame Notes
Info Assignments Office Hours Hall of Fame Notes

Naive Bayes

Assigned: 2020-10-06
Due Date: 2020-10-13 by 11:59PM U.S. Central Time

Why did the naive Bayesian suddenly feel patriotic when he heard fireworks?
He assumed independence.

— Some dude on Quora, idk: https://www.quora.com/What-are-some-good-machine-learning-jokes

Goals

Background

Handwriting recognition is the ability of a computer to interpret hand-written text as unicode characters. In this assignment, we will be attempting to recognize numbers from images using a Naive Bayes classifier.

Naive Bayes classifiers are a family of probabilistic classifiers that are based on Bayes’ theorem. These algorithms work by combining the probabilities that an instance belongs to a class based on the value of a set of features. In this case, we will be classifying images of handwritten digits; each image depicts a number between 0 to 9, inclusive.

There will be workshops helping you understand the assignment and various concepts related to it (such as operator overloading, command line arguments, and the math behind Naive Bayes), so we strongly suggest that you attend them.

Gradescope

You will need to submit your GitHub repository to Gradescope.

Getting Started

Get your copy of the assignment here.

Make sure that you have completed the cinder setup instructions from this campuswire post for setting up cinder.

Create a directory called my-projects in your ~/Cinder folder. Navigate to the my-projects folder and clone your repository within it, and then open the repository using CLion. You should be able to run three different targets; make sure that all three of them work. Important: your naivebayes repo should be in the following location: /path/to/cinder/my-projects/naivebayes-your-username. Again, you should be cloning your repository within my-projects rather than renaming your repo to my-projects. If your naivebayes repo is one directory above or below where it should be, you will not be able to build your project and will most likely get a "include could not find load file" error.

MacOS users: It’s possible that you may come across a Permission denied error. In this case, try doing: Edit configurations -> Executable -> Select Other -> Find my-projects/<repo_name>/cmake-build-debug/Debug/<target_name>/Contents/MacOS/<target_name>, and then click run again.

You should make a branch called week1 in your repository, and only commit to this branch. Nothing should be committed to master. When you’re finished with week 1, you can open a pull request to request that your code be merged into master, and this pull request will be reviewed by your code moderator. (Then you’ll make a branch called week2 in the second week of the assignment, and go through the same process.)

We've changed the format of the CMake project a bit:

We’ve provided starter code for you. Please rename the files we asked you to rename. Note that anything related to a sketchpad (the provided code) or Cinder will be used for week 2. You don’t have to worry about it this week, but we suggest taking a look at it when thinking about how to design your code for week 1. You should only be writing code in include/core, src/core, and tests during week 1.

The Data Files

For this assignment, you will need a set of pre-labeled data files that we will use for training, which you can download here. The .zip file contains a few text files:

Do not commit these files to your Git repo! Instead, make sure they're blacklisted using the .gitignore. If you do accidentally commit them to your repo, you should remove them with git rm -r --cached <filename>.

You're going to build a Naive Bayes classification model that can be used to classify unknown images as numbers from 0 to 9. In the file, each image is represented by 28 lines of 28 ASCII characters, so they contain 28 x 28 pixels. An individual character in an image will either be ' ' (a single space) for a white pixel, a '+' for a gray pixel, or a '#' for a black pixel. For simplicity, we suggest assuming that white pixels are "not shaded" and gray or black pixels are "shaded;” in other words, we suggest treating the gray and black pixels in the same way. For extra credit, you can differentiate between gray and black pixels when building your classifier.

Each image has a classification in the corresponding labels file; for example, the fourth 28 x 28 image’s correct classification (from trainingimages) would be the fourth number in traininglabels.

Flexibility in Image Size

Your model should be able to train on arbitrary image sizes. For example, if we gave you a file of 5 x 5 images, your model should be able to handle that without any reconfiguration. You may assume that all of the images you receive for a single training session are of the same size. Also, your model only needs to be able to classify images that are of the same size as those in the training data. This flexibility, with the provided caveats, will come in handy when you’re writing tests.

Part 0: Testing

So how do you know if your model is doing what it’s supposed to? You have to test it! Remember that we emphasize test-driven development in this class: writing tests first can help you understand what code you should write and which edge cases you should consider.

For this assignment, testing is particularly crucial — since you have not yet implemented the final classification step of Naive Bayes, you will not have any feedback on the correctness of your code without thorough testing. Your tests should include (but should not be limited to):

In order to do this, you should create your own dataset containing a smaller number of smaller images. This allows you to compute the expected outputs by hand and utilize those in your test cases.

Another helpful tip: Comparing that two floating point numbers are exactly equal is unreliable and likely to fail. In order to deal with this, you may find this Catch2 feature useful: https://github.com/catchorg/Catch2/blob/master/docs/assertions.md#floating-point-comparisons

Part 1: Operator Overloading

In this assignment, you'll have your first exposure to operator overloading in C++; specifically, you’ll be overloading the >> operator to support reading in data.

You will be using file streams to read from the provided data files, and you will need to overload std::istream &operator>> to support reading in the set of training images. The reason why we prefer overloading >> rather than just creating a method that accepts the filename string as a parameter is that it’s more flexible. With this design, you should be able to support receiving data from arbitrary sources, like network connections, although that's not something you have to worry about for this assignment.

When implementing this section, keep in mind that you’ll need to be able to support arbitrary image sizes.

Note: If you need more help understanding operator overloading, we strongly recommend attending the C++ workshops.

Part 2: Training the Model

For the math behind this assignment, please refer to this document

We expect you to develop a model that stores the feature and prior probabilities computed from our provided training files. Please note that this is only one of your deliverables; see the end of this document for a complete list.

Part 3: Saving and loading data

During the training stage, your program will have to read and process a total of 282 * 5000 = 3.92 million characters; in other words, there’s a large computational cost associated with training. Regenerating our model every time we wanted to classify something is inefficient. To keep the runtime costs low, we will require you to be able to save a trained model to a file, and load that file back into a model at a later time -- hint: you may want to review how operator overloading works.

Here’s some general advice: it’ll be useful to ask yourself: what data is essential to the state of my model and how can I represent it? Then, if you just write all of this information into a file, you should be able to read the information from that file at a later time. Remember that the file containing the model does not have to be human-readable -- it just needs to be able to be parsed by your program when loading in the model from a file.

Part 4: Creating an executable that accepts command line arguments

Instead of hardcoding the file paths of the training images, training labels, and location to save the model, you should allow the user to specify these file paths via command-line arguments. (In CLion, you can go to Run > Edit Configurations in order to change the program arguments).

You will need to process command line arguments and fail gracefully if the user inputs an incorrect number of command line arguments. We expect your code to handle the following using command line parsing:

For example, you could implement an interface that supports the following command line arguments to train the model using given training images and training label files and to save the model to a specific file:

./train_model train data/mnistdatatraining/trainingimages data/mnistdatatraining/traininglabels save data/saved_model

You may use any command line parsing library you want or implement your own parsing library (implementation hint: you may want to think about what your code should do if the user provides incorrect command line arguments). Notice that the example we have above is quite simple and requires the filenames to be in a specific order; you may want to look into the following command line parsing libraries if you want a more robust interface.

Grading and Deliverables

Assignment Rubric

This rubric is not a comprehensive checklist. Please make sure to go over the feedback you received on your previous MPs and ask your moderator/post on Campuswire if you have any questions.

Also, be sure to read over the Google C++ Style Guide here -- some parts, especially with respect to naming conventions, are quite different from what you’re used to and you will be marked off if you don’t follow the style guide.

Click here to view

C++

  • Headers have inclusion guards (#ifndef or #pragma once)
  • All method specifications are in the header class, and operational line comments are left in the source file.
  • Code has appropriate namespacing
  • Usage of const and references
    • Instance methods marked as const where appropriate
    • Parameters passed as const reference where appropriate
    • For-each loop variables declared as const reference where appropriate
  • size_t used in lieu of int where appropriate
  • Avoid using namespace std to avoid naming collisions. Instead, only use specific things, e.g. using std::string
  • Proper memory management
    • If you do decide to use new, remember that every call to new must be matched by a call to delete. Remember that non-built-in types can be allocated on the stack (without using ‘new’ in C++)
    • Avoid returning a pointer/reference to memory on the stack in a function (this is called a “dangling pointer” memory error)
    • Initialize built-in types allocated on the stack (ex: int x = 0; rather than int x;) before using them
  • Appropriate usage of structs, classes, and namespaces

Readability and flexibility of code

  • Modularity: each method should perform one distinct task
  • It should be easy to read through each method, follow its control flow, and verify its correctness
  • The code should be flexible/ready for change (no magic numbers, no violations of DRY)

Object decomposition

  • Member variables stored by each class
    • Classes should store their data using the most intuitive data structure
    • No "missing" member variables
    • No member variables which should be local variables
    • No redundancy / storing multiple copies of the same data in different formats
  • Encapsulation
    • Appropriate access modifiers
    • Member variables should generally only modified by member functions in the same class
    • The interface of a class should be intuitive/abstract, and external code should only interact with the class via the interface
      • By intuitive, we mean that it should be easy to understand and use the class, and there shouldn’t be any hidden assumptions about how the class should be used
      • By abstract, we mean that an external client shouldn’t need to worry about the internal details of the class
  • No unnecessary getters/setters (exception: you may make getters and setters for testing purposes)

Documentation

  • Specifications
    • Specifications are required for all functions which are part of the public interface of a class
    • Specifications should precisely describe the inputs and outputs of a function, and should also describe what the function does (e.g. mutating state of object)
    • Specifications should also be formatted properly
  • Inline comments should not describe things which are obvious from the code, and should describe things which need clarification

Naming

  • Semantics: names should effectively describe the entities they represent; they should be unambiguous and leave no potential for misinterpretation. However, they should not be too verbose.
  • Style: names should follow the Google C++ Style Guide

Layout

  • Spacing should be readable and consistent; your code should look professional
  • Vertical whitespace should be meaningful
  • Vertical whitespace can help create paragraphs
  • Having 2+ empty lines in a row, or empty lines at the beginning or end of files, is usually a waste of space and looks inconsistent
  • Horizontal whitespace should be present where required by the Google Style Guide
  • Lines should all be under 100 characters; no horizontal scrolling should be necessary

Testing

  • You should make sure all classes of inputs and outputs are tested.
  • Boundary/edge cases often cause different/unexpected behavior, and thus, they should be tested
  • Your tests should cover all of the functionality that you’ve implemented. In other words, every line of code should be exercised by some test case, unless the assignment documentation says otherwise
    • You should be testing for correctness. Testing whether your model has an accuracy over a certain threshold does not guarantee that your model is correct.
    • Test every non-trivial public method with all possible classes of input
  • Each individual test case should only serve one coherent purpose. Individual test cases should not have assertions testing unrelated things
  • Your tests, like your code, should be organized and easy to understand. This includes:
    • Easy to verify thoroughness / all possibilities covered
    • Easy to verify the correctness of each test case
    • Clear categories of test cases, where similar tests are grouped together
  • Test case descriptions make the purpose of each test case clear
  • Appropriate usage of SECTION and TEST_CASE to organize your code

Process

  • Commit modularity
    • Code should be checked-in periodically/progressively in logical chunks
    • Unrelated changes should not be bundled in the same commit
    • Do not treat commits as save points
  • Commit messages
    • Should concisely and accurately describe the changes made
    • Should have a consistent style and look professional
    • First word of the message should be a verb, and it should be capitalized
    • Commit message header should be no longer than 50 characters; in general, if you find the need to use “and” to group multiple unrelated descriptions together, you should break your commit message up

Presentation

  • Arrived on time with all necessary materials and ready to go
  • Good selection of topics to focus on
  • Logical order of presentation
  • Appropriate pacing and engagement of the fellow students
  • Speaking loud enough and enunciating clearly

Participation

  • Each student should contribute at least one meaningful comment or question for every other student who presents in his/her code review
  • Students must behave respectfully to moderator and other students

Weightings

Your grades for each section of the rubric will be weighted as follows:

  • C++ (10%)
  • Readability and flexibility of code (15%)
  • Object decomposition (20%)
  • Documentation (7.5%)
  • Naming (5%)
  • Layout (5%)
  • Testing (20%)
  • Process (7.5%)
  • Presentation (5%)
  • Participation (5%)