CS 126 - Software Design Studio

Naive Bayes Week 2

Assigned: 2022-03-30
Due Date: 2022-04-05 by 11:59PM U.S. Central Time

Goals

Implement the classification portion of Naive Bayes
Connect your implementation of Naive Bayes to a sketchpad implemented in Cinder

Background

In this assignment you will refactor and improve your code based on feedback given to you in code review. You will implement the classification portion of this assignment on top of your existing code for training, loading, and saving a model.

Next, you will install Cinder and learn how to use this library. We have provided an implementation of a sketchpad using Cinder, and we expect you to fully understand our code.

Next, you will connect our sketchpad to your implementation of Naive Bayes such that a number from 0 - 9 drawn using the sketchpad can be classified.

Gradescope

You will need to submit your GitHub repository to Gradescope. There is a linter which is run on each submission; the results of the linter will be viewable by your code moderator when grading.

Getting Started

You will develop this assignment in the same repository as last week, using your existing code. This assignment will use the Cinder framework -- you should have set this up in your Cinder directory, but if not, make sure to do so before starting this week's assignment. Although we will be providing you Cinder starter code, we suggest taking a look at this Cinder tutorial and the Cinder documentation if you need a refresher.

You should make a branch called week2 on top of your week1 branch (ex: git checkout -b week2 week1) and only commit to this branch. Nothing should be committed to master. When you’re finished with week2, you can open a pull request to request that your code be merged into master, and this pull request will be reviewed by your code moderator.

The Data Files

For this assignment you will need a set of pre-labeled data files that we will use for validating data, which you can download here.

The .zip file contains a few text files:

testimagesandlabels: 1000 images for validation/testing and their labels (in the same format as week one)

Like last week, we expect you not to commit these files to your git repository. Make sure they're blacklisted using the .gitignore. If you accidentally committed them to your repo, you should remove them with git rm -r --cached <filename>.

The specification of the content inside the images and labels files is the same as the one provided last week.

Part 0 - Testing

Similar to last week, you should test the mathematical correctness of your classifier by using a small dataset where you can cross-check the answers by hand. This includes the likelihood scores and the actual prediction made by the classifier.

Remember that we want to test for mathematical correctness; for example, the following test is not sufficient:

REQUIRE(accuracy >= 0.7);

Testing whether the accuracy of your classifier is over 70% works as a sanity check, but you need to test whether the math behind it is correct -- for example, there might be a bug that mixes up the classifications for classes 0 and 9 but classifies everything else correctly.

Part 1 - Improving your Code

Review the feedback on your code for the previous week in Gradescope and implement the changes that your moderator suggested. You will be evaluated on improvement this week, so make sure to click through each section of the Gradescope rubric to see comments specific to each topic and to implement verbal feedback from code review.

Part 2 - Classification

For the math behind this portion of the assignment, please refer this document.

Deliverables (the deliverables for this part of the assignment are also listed in the blue box at the end of section 2.3 in the above document):

Given a trained model and a new image that doesn’t belong to the training dataset, you should be able to calculate the “likelihood scores” for each of the digits 0-9.
You should be able to determine which digit has the highest likelihood score, and classify the image as that digit.

Part 3 - Validation

What good is a classifier if you don't know how accurate it is? We've given you a set of images and labels inside the .zip file which you should use to validate your model’s accuracy. This file follows the same format as the training images and labels described in week 1’s documentation and you should be able to parse them in a similar manner. Remember that we do not want to test our model on our training images and labels, which is why we provided you with two new files for testing.

You should incorporate the following functionality to your existing project:

classify each of the images in testimagesandlabels
compare the result of your classifier to the actual labels given in the same file
print out the accuracy of your Naive Bayes classifier

Remember to make your code flexible, so it should be easy to change/modify the following:

Decide whether to save/load the model from a file, or to train a new model
Differentiate training the model from testing the model (classification)
Change the filenames corresponding to the files containing test images labels

Part 4 - Visualization

Finally, you would want to see your classifier in action as it performs real-time classification of sketches using Cinder. We have provided you with some starter code, but it is your job to fill in the blanks and get the application up and running. When you are finished, you should be able to draw an image in the sketchpad and classify it by pressing the enter key. You can clear the image drawn by pressing the delete key (or FN-Delete if you have a Mac). Don’t worry if the sketchpad doesn’t classify all the sketches correctly. After all, Naive Bayes is a pretty naive model that makes some sketch-y assumptions. If you aren’t happy with the performance, it might be a cool final project idea to implement a more sophisticated machine learning algorithm!

Command line arguments - More EC

Last week, you may have functionality to parse command line arguments for extra credit. This week, you can extend your functionality to allow the user to choose whether to test their model.

Note that this means that the command line parsing you implemented last week should still work and that you must consider logical combinations of the options provided: for example, you would need to handle the following cases (note that this isn’t a complete list of all the cases you must handle):

Allow the user to train the model only (logically, they would need to then save the model to a file or there wouldn’t be any point in training the model -- but we will leave how to handle cases like this to you)
Allow the user to test a model they’ve loaded in
Allow the user to train a model and test it

Other Extra Credit Opportunities

Important: you should focus on finishing your assignment before working on any of the suggested extra credit features. Furthermore, try to gauge the amount of time you have left: if you start an extra credit feature, you should finish it. Any non-trivial enhancements and/or additional classification algorithms will be awarded extra credit. Here are some ideas that might interest you (You would have to implement something else if you already implemented one of the below the previous week):

k-Nearest Neighbors
Gaussian Naive Bayes
Decision Trees
Voting/Boosting
Artificial Neural Networks (difficult!)
Output a confusion matrix generated by your classifier (simple!)

Feel free to use Machine Learning libraries for the extra credit portion.

Grading and Deliverables

Improve your code by reading and implementing the changes your moderator suggested. Your implementation must still support all the features from last week’s assignment (proper use of operator overloading, training the model accurately, etc)
Classify images from a file based on your model.
Use Cinder to create a visual representation of your classifier. Note that you’ll need to adapt the starter code we’ve provided.
You must test your classifier for correctness using unit tests (Does your classifier hit a certain percentage of accuracy? Does it behave as expected for small sets of test data? Does it work for different image sizes, given training and testing on the same image size? Is the math behind each step of classification correct?)
Lastly, you must follow the Google C++ Style Guide with regards to naming and whitespace.

Hints

It might help reviewing the workshops conducted on this assignment to become familiar with the Cinder framework. Also, be sure to review the documents hyper-linked to the documentation, they contain important information that will be required for the assignment.

Assignment Rubric

This rubric is not a comprehensive checklist. Please make sure to go over the feedback you received on your previous MPs and ask your moderator/post on Campuswire if you have any questions.

Similar to API Adventures, we expect you to take your feedback from Naive Bayes Part 1 into account -- you will lose points for not changing your code in accordance with your moderator’s feedback.

Click here to view

C++

Headers have inclusion guards (#ifndef or #pragma once)
All method specifications are in the header class, and operational line comments are left in the source file.
Code has appropriate namespacing
Usage of const and references
- Instance methods marked as const where appropriate
- Parameters passed as const reference where appropriate
- For-each loop variables declared as const reference where appropriate
size_t used in lieu of int where appropriate
Avoid using namespace std to avoid naming collisions. Instead, only use specific things, e.g. using std::string
Proper memory management
- If you do decide to use new, remember that every call to new must be matched by a call to delete. Remember that non-built-in types can be allocated on the stack (without using ‘new’ in C++)
- Avoid returning a pointer/reference to memory on the stack in a function (this is called a “dangling pointer” memory error)
- Initialize built-in types allocated on the stack (ex: int x = 0; rather than int x;) before using them
Appropriate usage of structs, classes, and namespaces

Readability and flexibility of code

Modularity: each method should perform one distinct task
It should be easy to read through each method, follow its control flow, and verify its correctness
The code should be flexible/ready for change (no magic numbers, no violations of DRY)

Object decomposition

Member variables stored by each class
- Classes should store their data using the most intuitive data structure
- No "missing" member variables
- No member variables which should be local variables
- No redundancy / storing multiple copies of the same data in different formats
Encapsulation
- Appropriate access modifiers
- Member variables should generally only modified by member functions in the same class
- The interface of a class should be intuitive/abstract, and external code should only interact with the class via the interface
  - By intuitive, we mean that it should be easy to understand and use the class, and there shouldn’t be any hidden assumptions about how the class should be used
  - By abstract, we mean that an external client shouldn’t need to worry about the internal details of the class
No unnecessary getters/setters (exception: you may make getters and setters for testing purposes)

Documentation

Specifications
- Specifications are required for all functions which are part of the public interface of a class
- Specifications should precisely describe the inputs and outputs of a function, and should also describe what the function does (e.g. mutating state of object)
- Specifications should also be formatted properly
Inline comments should not describe things which are obvious from the code, and should describe things which need clarification

Naming

Semantics: names should effectively describe the entities they represent; they should be unambiguous and leave no potential for misinterpretation. However, they should not be too verbose.
Style: names should follow the Google C++ Style Guide

Layout

Spacing should be readable and consistent; your code should look professional
Vertical whitespace should be meaningful
Vertical whitespace can help create paragraphs
Having 2+ empty lines in a row, or empty lines at the beginning or end of files, is usually a waste of space and looks inconsistent
Horizontal whitespace should be present where required by the Google Style Guide
Lines should all be under 100 characters; no horizontal scrolling should be necessary

Testing

You should make sure all classes of inputs and outputs are tested.
Boundary/edge cases often cause different/unexpected behavior, and thus, they should be tested
Your tests should cover all of the functionality that you’ve implemented. In other words, every line of code should be exercised by some test case, unless the assignment documentation says otherwise
- You should be testing for correctness. Testing whether your model has an accuracy over a certain threshold does not guarantee that your model is correct.
- Test every non-trivial public method with all possible classes of input
Each individual test case should only serve one coherent purpose. Individual test cases should not have assertions testing unrelated things
Your tests, like your code, should be organized and easy to understand. This includes:
- Easy to verify thoroughness / all possibilities covered
- Easy to verify the correctness of each test case
- Clear categories of test cases, where similar tests are grouped together
Test case descriptions make the purpose of each test case clear
Appropriate usage of SECTION and TEST_CASE to organize your code

Process

Commit modularity
- Code should be checked-in periodically/progressively in logical chunks
- Unrelated changes should not be bundled in the same commit
- Do not treat commits as save points
Commit messages
- Should concisely and accurately describe the changes made
- Should have a consistent style and look professional
- First word of the message should be a verb, and it should be capitalized
- Commit message header should be no longer than 50 characters; in general, if you find the need to use “and” to group multiple unrelated descriptions together, you should break your commit message up

Presentation

Arrived on time with all necessary materials and ready to go
Good selection of topics to focus on
Logical order of presentation
Appropriate pacing and engagement of the fellow students
Speaking loud enough and enunciating clearly

Participation

Each student should contribute at least one meaningful comment or question for every other student who presents in his/her code review
Students must behave respectfully to moderator and other students

Weightings

Your grades for each section of the rubric will be weighted as follows:

C++ (10%)
Readability and flexibility of code (15%)
Object decomposition (20%)
Documentation (7.5%)
Naming (5%)
Layout (5%)
Testing (20%)
Process (7.5%)
Presentation (5%)
Participation (5%)