Info Lectures Assignments Office Hours Hall of Fame Notes
Info Lectures Assignments Office Hours Hall of Fame Notes

Naive Bayes Week 2

Assigned: 2022-03-30
Due Date: 2022-04-05 by 11:59PM U.S. Central Time

Goals

Background

In this assignment you will refactor and improve your code based on feedback given to you in code review. You will implement the classification portion of this assignment on top of your existing code for training, loading, and saving a model.

Next, you will install Cinder and learn how to use this library. We have provided an implementation of a sketchpad using Cinder, and we expect you to fully understand our code.

Next, you will connect our sketchpad to your implementation of Naive Bayes such that a number from 0 - 9 drawn using the sketchpad can be classified.

Gradescope

You will need to submit your GitHub repository to Gradescope. There is a linter which is run on each submission; the results of the linter will be viewable by your code moderator when grading.

Getting Started

You will develop this assignment in the same repository as last week, using your existing code. This assignment will use the Cinder framework -- you should have set this up in your Cinder directory, but if not, make sure to do so before starting this week's assignment. Although we will be providing you Cinder starter code, we suggest taking a look at this Cinder tutorial and the Cinder documentation if you need a refresher.

You should make a branch called week2 on top of your week1 branch (ex: git checkout -b week2 week1) and only commit to this branch. Nothing should be committed to master. When you’re finished with week2, you can open a pull request to request that your code be merged into master, and this pull request will be reviewed by your code moderator.

The Data Files

For this assignment you will need a set of pre-labeled data files that we will use for validating data, which you can download here.

The .zip file contains a few text files:

Like last week, we expect you not to commit these files to your git repository. Make sure they're blacklisted using the .gitignore. If you accidentally committed them to your repo, you should remove them with git rm -r --cached <filename>.

The specification of the content inside the images and labels files is the same as the one provided last week.

Part 0 - Testing

Similar to last week, you should test the mathematical correctness of your classifier by using a small dataset where you can cross-check the answers by hand. This includes the likelihood scores and the actual prediction made by the classifier.

Remember that we want to test for mathematical correctness; for example, the following test is not sufficient:

REQUIRE(accuracy >= 0.7);

Testing whether the accuracy of your classifier is over 70% works as a sanity check, but you need to test whether the math behind it is correct -- for example, there might be a bug that mixes up the classifications for classes 0 and 9 but classifies everything else correctly.

Part 1 - Improving your Code

Review the feedback on your code for the previous week in Gradescope and implement the changes that your moderator suggested. You will be evaluated on improvement this week, so make sure to click through each section of the Gradescope rubric to see comments specific to each topic and to implement verbal feedback from code review.

Part 2 - Classification

For the math behind this portion of the assignment, please refer this document.

Deliverables (the deliverables for this part of the assignment are also listed in the blue box at the end of section 2.3 in the above document):

Part 3 - Validation

What good is a classifier if you don't know how accurate it is? We've given you a set of images and labels inside the .zip file which you should use to validate your model’s accuracy. This file follows the same format as the training images and labels described in week 1’s documentation and you should be able to parse them in a similar manner. Remember that we do not want to test our model on our training images and labels, which is why we provided you with two new files for testing.

You should incorporate the following functionality to your existing project:

Remember to make your code flexible, so it should be easy to change/modify the following:

Part 4 - Visualization

Finally, you would want to see your classifier in action as it performs real-time classification of sketches using Cinder. We have provided you with some starter code, but it is your job to fill in the blanks and get the application up and running. When you are finished, you should be able to draw an image in the sketchpad and classify it by pressing the enter key. You can clear the image drawn by pressing the delete key (or FN-Delete if you have a Mac). Don’t worry if the sketchpad doesn’t classify all the sketches correctly. After all, Naive Bayes is a pretty naive model that makes some sketch-y assumptions. If you aren’t happy with the performance, it might be a cool final project idea to implement a more sophisticated machine learning algorithm!

Command line arguments - More EC

Last week, you may have functionality to parse command line arguments for extra credit. This week, you can extend your functionality to allow the user to choose whether to test their model.

Note that this means that the command line parsing you implemented last week should still work and that you must consider logical combinations of the options provided: for example, you would need to handle the following cases (note that this isn’t a complete list of all the cases you must handle):

Other Extra Credit Opportunities

Important: you should focus on finishing your assignment before working on any of the suggested extra credit features. Furthermore, try to gauge the amount of time you have left: if you start an extra credit feature, you should finish it. Any non-trivial enhancements and/or additional classification algorithms will be awarded extra credit. Here are some ideas that might interest you (You would have to implement something else if you already implemented one of the below the previous week):

Feel free to use Machine Learning libraries for the extra credit portion.

Grading and Deliverables

Hints

It might help reviewing the workshops conducted on this assignment to become familiar with the Cinder framework. Also, be sure to review the documents hyper-linked to the documentation, they contain important information that will be required for the assignment.

Assignment Rubric

This rubric is not a comprehensive checklist. Please make sure to go over the feedback you received on your previous MPs and ask your moderator/post on Campuswire if you have any questions.

Similar to API Adventures, we expect you to take your feedback from Naive Bayes Part 1 into account -- you will lose points for not changing your code in accordance with your moderator’s feedback.

Click here to view

C++

  • Headers have inclusion guards (#ifndef or #pragma once)
  • All method specifications are in the header class, and operational line comments are left in the source file.
  • Code has appropriate namespacing
  • Usage of const and references
    • Instance methods marked as const where appropriate
    • Parameters passed as const reference where appropriate
    • For-each loop variables declared as const reference where appropriate
  • size_t used in lieu of int where appropriate
  • Avoid using namespace std to avoid naming collisions. Instead, only use specific things, e.g. using std::string
  • Proper memory management
    • If you do decide to use new, remember that every call to new must be matched by a call to delete. Remember that non-built-in types can be allocated on the stack (without using ‘new’ in C++)
    • Avoid returning a pointer/reference to memory on the stack in a function (this is called a “dangling pointer” memory error)
    • Initialize built-in types allocated on the stack (ex: int x = 0; rather than int x;) before using them
  • Appropriate usage of structs, classes, and namespaces

Readability and flexibility of code

  • Modularity: each method should perform one distinct task
  • It should be easy to read through each method, follow its control flow, and verify its correctness
  • The code should be flexible/ready for change (no magic numbers, no violations of DRY)

Object decomposition

  • Member variables stored by each class
    • Classes should store their data using the most intuitive data structure
    • No "missing" member variables
    • No member variables which should be local variables
    • No redundancy / storing multiple copies of the same data in different formats
  • Encapsulation
    • Appropriate access modifiers
    • Member variables should generally only modified by member functions in the same class
    • The interface of a class should be intuitive/abstract, and external code should only interact with the class via the interface
      • By intuitive, we mean that it should be easy to understand and use the class, and there shouldn’t be any hidden assumptions about how the class should be used
      • By abstract, we mean that an external client shouldn’t need to worry about the internal details of the class
  • No unnecessary getters/setters (exception: you may make getters and setters for testing purposes)

Documentation

  • Specifications
    • Specifications are required for all functions which are part of the public interface of a class
    • Specifications should precisely describe the inputs and outputs of a function, and should also describe what the function does (e.g. mutating state of object)
    • Specifications should also be formatted properly
  • Inline comments should not describe things which are obvious from the code, and should describe things which need clarification

Naming

  • Semantics: names should effectively describe the entities they represent; they should be unambiguous and leave no potential for misinterpretation. However, they should not be too verbose.
  • Style: names should follow the Google C++ Style Guide

Layout

  • Spacing should be readable and consistent; your code should look professional
  • Vertical whitespace should be meaningful
  • Vertical whitespace can help create paragraphs
  • Having 2+ empty lines in a row, or empty lines at the beginning or end of files, is usually a waste of space and looks inconsistent
  • Horizontal whitespace should be present where required by the Google Style Guide
  • Lines should all be under 100 characters; no horizontal scrolling should be necessary

Testing

  • You should make sure all classes of inputs and outputs are tested.
  • Boundary/edge cases often cause different/unexpected behavior, and thus, they should be tested
  • Your tests should cover all of the functionality that you’ve implemented. In other words, every line of code should be exercised by some test case, unless the assignment documentation says otherwise
    • You should be testing for correctness. Testing whether your model has an accuracy over a certain threshold does not guarantee that your model is correct.
    • Test every non-trivial public method with all possible classes of input
  • Each individual test case should only serve one coherent purpose. Individual test cases should not have assertions testing unrelated things
  • Your tests, like your code, should be organized and easy to understand. This includes:
    • Easy to verify thoroughness / all possibilities covered
    • Easy to verify the correctness of each test case
    • Clear categories of test cases, where similar tests are grouped together
  • Test case descriptions make the purpose of each test case clear
  • Appropriate usage of SECTION and TEST_CASE to organize your code

Process

  • Commit modularity
    • Code should be checked-in periodically/progressively in logical chunks
    • Unrelated changes should not be bundled in the same commit
    • Do not treat commits as save points
  • Commit messages
    • Should concisely and accurately describe the changes made
    • Should have a consistent style and look professional
    • First word of the message should be a verb, and it should be capitalized
    • Commit message header should be no longer than 50 characters; in general, if you find the need to use “and” to group multiple unrelated descriptions together, you should break your commit message up

Presentation

  • Arrived on time with all necessary materials and ready to go
  • Good selection of topics to focus on
  • Logical order of presentation
  • Appropriate pacing and engagement of the fellow students
  • Speaking loud enough and enunciating clearly

Participation

  • Each student should contribute at least one meaningful comment or question for every other student who presents in his/her code review
  • Students must behave respectfully to moderator and other students

Weightings

Your grades for each section of the rubric will be weighted as follows:

  • C++ (10%)
  • Readability and flexibility of code (15%)
  • Object decomposition (20%)
  • Documentation (7.5%)
  • Naming (5%)
  • Layout (5%)
  • Testing (20%)
  • Process (7.5%)
  • Presentation (5%)
  • Participation (5%)