Goals
- Parsing command line arguments
- Implement the classification portion of Naive Bayes
- Connect your implementation of Naive Bayes to a sketchpad implemented in Cinder
Background
In this assignment you will refactor and improve your code based on feedback given to you in code review. You will implement the classification portion of this assignment on top of your existing code for training, loading, and saving a model.
Next, you will install Cinder and learn how to use this library. We have provided an implementation of a sketchpad using Cinder, and we expect you to fully understand our code.
Next, you will connect our sketchpad to your implementation of Naive Bayes such that a number from 0 - 9 drawn using the sketchpad can be classified.
Gradescope
You will need to submit your GitHub repository to Gradescope. There is a linter which is run on each submission; the results of the linter will be viewable by your code moderator when grading.
Getting Started
You will develop this assignment in the same repository as last week, using your existing code. This assignment will use the Cinder framework -- you should have set up Cinder as a part of the C++ Tic Tac Toe assignment, but if not, make sure to download the latest version for your platform here. Although we will be providing you Cinder starter code, we suggest taking a look at this basic Cinder tutorial and the Cinder documentation. There will also be a workshop tutorial discussing how to use Cinder this week.
You should make a branch called week2 on top of your week1 branch (ex: git checkout -b week2 week1) and only commit to this branch. Nothing should be committed to master. When you’re finished with week2, you can open a pull request to request that your code be merged into master, and this pull request will be reviewed by your code moderator.
The Data Files
For this assignment you will need a set of pre-labeled data files that we will use for validating data, which you can download here.
The .zip file contains a few text files:
testimages
: 1000 images for validation/testing purposes
testlabels
: The correct classification for each testing image
Like last week, we expect you not to commit these files to your git repository. Make sure they're blacklisted using the .gitignore. If you accidentally committed them to your repo, you should remove them with git rm -r --cached <filename>
.
The specification of the content inside the images and labels files is the same as the one provided last week.
Part 0 - Testing
Similar to last week, you should test the mathematical correctness of your classifier by using a small dataset where you can cross-check the answers by hand. This includes the likelihood scores and the actual prediction made by the classifier.
Remember that we want to test for mathematical correctness; for example, the following test is not sufficient:
REQUIRE(accuracy >= 0.7);
Testing whether the accuracy of your classifier is over 70% works as a sanity check, but you need to test whether the math behind it is correct -- for example, there might be a bug that mixes up the classifications for classes 0 and 9 but classifies everything else correctly.
Part 1 - Improving your Code
Review the feedback on your code for the previous week in Gradescope and implement the changes that your moderator suggested. You will be evaluated on improvement this week, so make sure to click through each section of the Gradescope rubric to see comments specific to each topic and to implement verbal feedback from code review.
Part 2 - Command line arguments
Last week, you added functionality to parse command line arguments. This week, you will be extending your functionality to allow the user to choose whether to test their model.
Note that this means that the command line parsing you implemented last week must still work and that you must consider logical combinations of the options provided: for example, you need to handle the following cases (note that this isn’t a complete list of all the cases you must handle):
- Allow the user to train the model only (logically, they would need to then save the model to a file or there wouldn’t be any point in training the model -- but we will leave how to handle cases like this to you)
- Allow the user to test a model they’ve loaded in
- Allow the user to train a model and test it
Part 3 - Classification
For the math behind this portion of the assignment, please refer this document.
Deliverables (the deliverables for this part of the assignment are also listed in the blue box at the end of section 2.3 in the above document):
- Given a trained model and a new image that doesn’t belong to the training dataset, you should be able to calculate the “likelihood scores” for each of the digits 0-9.
- You should be able to determine which digit has the highest likelihood score, and classify the image as that digit.
Part 4 - Validation
What good is a classifier if you don't know how accurate it is? We've given you a set of images and labels inside the .zip file (testimages
, testlabels
) which you should use to validate your model’s accuracy. These 2 files follow the same format as the training images and labels described in week 1’s documentation and you should be able to parse them in a similar manner. Remember that we do not want to test our model on our training images and labels, which is why we provided you with two new files for testing.
You should incorporate the following functionality to your existing project:
- classify each of the images in
testimages
- compare the result of your classifier to the actual labels (in
testlabels
)
- print out the accuracy of your Naive Bayes classifier
Similar to Week 1, we expect you to utilize command line argument parsing so your executable can both train a model and classify numbers. We expect your executable to be able to
- Differentiate training the model from testing the model (classification)
- Take in filenames corresponding to the files containing test images and test labels if the user wishes to modify
Part 5 - Visualization
Finally, you would want to see your classifier in action as it performs real-time classification of sketches using Cinder. We have provided you with some starter code, but it is your job to fill in the blanks and get the application up and running. When you are finished, you should be able to draw an image in the sketchpad and classify it by pressing the enter key. You can clear the image drawn by pressing the delete key (or FN-Delete if you have a Mac). Don’t worry if the sketchpad doesn’t classify all the sketches correctly. After all, Naive Bayes is a pretty naive model that makes some sketch-y assumptions. If you aren’t happy with the performance, it might be a cool final project idea to implement a more sophisticated machine learning algorithm!
Important: you should focus on finishing your assignment before working on any of the suggested extra credit features. Furthermore, try to gauge the amount of time you have left: if you start an extra credit feature, you should finish it. Any non-trivial enhancements and/or additional classification algorithms will be awarded extra credit. Here are some ideas that might interest you (You would have to implement something else if you already implemented one of the below the previous week):
Feel free to use Machine Learning libraries for the extra credit portion.
Grading and Deliverables
- Improve your code by reading and implementing the changes your moderator suggested. Your implementation must still support all the features from last week’s assignment (proper use of operator overloading, training the model accurately, etc)
- Classify images from a file based on your model.
- Use Cinder to create a visual representation of your classifier. Note that you’ll need to adapt the starter code we’ve provided.
- You must test your classifier for correctness using unit tests (Does your classifier hit a certain percentage of accuracy? Does it behave as expected for small sets of test data? Does it work for different image sizes, given training and testing on the same image size? Is the math behind each step of classification correct?)
- Lastly, you must follow the Google C++ Style Guide with regards to naming and whitespace.
Hints
It might help reviewing the workshops conducted on this assignment to become familiar with the Cinder framework. Also, be sure to review the documents hyper-linked to the documentation, they contain important information that will be required for the assignment.
Assignment Rubric
This rubric is not a comprehensive checklist. Please make sure to go over the feedback you received on your previous MPs and ask your moderator/post on Campuswire if you have any questions.
Similar to API Adventures, we expect you to take your feedback from Naive Bayes Part 1 into account -- you will lose points for not changing your code in accordance with your moderator’s feedback.
Click here to view
C++
- Headers have inclusion guards (#ifndef or #pragma once)
- All method specifications are in the header class, and operational line comments are left in the source file.
- Code has appropriate namespacing
- Usage of const and references
- Instance methods marked as const where appropriate
- Parameters passed as const reference where appropriate
- For-each loop variables declared as const reference where appropriate
- size_t used in lieu of int where appropriate
- Avoid using namespace std to avoid naming collisions. Instead, only use specific things, e.g. using std::string
- Proper memory management
- If you do decide to use new, remember that every call to new must be matched by a call to delete. Remember that non-built-in types can be allocated on the stack (without using ‘new’ in C++)
- Avoid returning a pointer/reference to memory on the stack in a function (this is called a “dangling pointer” memory error)
- Initialize built-in types allocated on the stack (ex: int x = 0; rather than int x;) before using them
- Appropriate usage of structs, classes, and namespaces
Readability and flexibility of code
- Modularity: each method should perform one distinct task
- It should be easy to read through each method, follow its control flow, and verify its correctness
- The code should be flexible/ready for change (no magic numbers, no violations of DRY)
Object decomposition
- Member variables stored by each class
- Classes should store their data using the most intuitive data structure
- No "missing" member variables
- No member variables which should be local variables
- No redundancy / storing multiple copies of the same data in different formats
- Encapsulation
- Appropriate access modifiers
- Member variables should generally only modified by member functions in the same class
- The interface of a class should be intuitive/abstract, and external code should only interact with the class via the interface
- By intuitive, we mean that it should be easy to understand and use the class, and there shouldn’t be any hidden assumptions about how the class should be used
- By abstract, we mean that an external client shouldn’t need to worry about the internal details of the class
- No unnecessary getters/setters (exception: you may make getters and setters for testing purposes)
Documentation
- Specifications
- Specifications are required for all functions which are part of the public interface of a class
- Specifications should precisely describe the inputs and outputs of a function, and should also describe what the function does (e.g. mutating state of object)
- Specifications should also be formatted properly
- Inline comments should not describe things which are obvious from the code, and should describe things which need clarification
Naming
- Semantics: names should effectively describe the entities they represent; they should be unambiguous and leave no potential for misinterpretation. However, they should not be too verbose.
- Style: names should follow the Google C++ Style Guide
Layout
- Spacing should be readable and consistent; your code should look professional
- Vertical whitespace should be meaningful
- Vertical whitespace can help create paragraphs
- Having 2+ empty lines in a row, or empty lines at the beginning or end of files, is usually a waste of space and looks inconsistent
- Horizontal whitespace should be present where required by the Google Style Guide
- Lines should all be under 100 characters; no horizontal scrolling should be necessary
Testing
- You should make sure all classes of inputs and outputs are tested.
- Boundary/edge cases often cause different/unexpected behavior, and thus, they should be tested
- Your tests should cover all of the functionality that you’ve implemented. In other words, every line of code should be exercised by some test case, unless the assignment documentation says otherwise
- You should be testing for correctness. Testing whether your model has an accuracy over a certain threshold does not guarantee that your model is correct.
- Test every non-trivial public method with all possible classes of input
- Each individual test case should only serve one coherent purpose. Individual test cases should not have assertions testing unrelated things
- Your tests, like your code, should be organized and easy to understand. This includes:
- Easy to verify thoroughness / all possibilities covered
- Easy to verify the correctness of each test case
- Clear categories of test cases, where similar tests are grouped together
- Test case descriptions make the purpose of each test case clear
- Appropriate usage of SECTION and TEST_CASE to organize your code
Process
- Commit modularity
- Code should be checked-in periodically/progressively in logical chunks
- Unrelated changes should not be bundled in the same commit
- Do not treat commits as save points
- Commit messages
- Should concisely and accurately describe the changes made
- Should have a consistent style and look professional
- First word of the message should be a verb, and it should be capitalized
- Commit message header should be no longer than 50 characters; in general, if you find the need to use “and” to group multiple unrelated descriptions together, you should break your commit message up
Presentation
- Arrived on time with all necessary materials and ready to go
- Good selection of topics to focus on
- Logical order of presentation
- Appropriate pacing and engagement of the fellow students
- Speaking loud enough and enunciating clearly
Participation
- Each student should contribute at least one meaningful comment or question for every other student who presents in his/her code review
- Students must behave respectfully to moderator and other students
Weightings
Your grades for each section of the rubric will be weighted as follows:
- C++ (10%)
- Readability and flexibility of code (15%)
- Object decomposition (20%)
- Documentation (7.5%)
- Naming (5%)
- Layout (5%)
- Testing (20%)
- Process (7.5%)
- Presentation (5%)
- Participation (5%)