Bold Bloom Filters

Due: Apr 28, 23:59 PM

Learning Objectives

  • Implement the bloom filter using C++ built-in data structures
  • Understand bitvectors and fundamental bitvector operations
  • Measure the accuracy of a bloom filter over its multi-dimensional parameter space

Assignment Description

In this lab we will be implementing bloom filters and building up our understanding of probabilistic data structures as well as fundamental bit operations.

Lab Insight

Bloom filters are a great example of a probablistic data structure – a data structure whose accuracy is not guaranteed but is usually bounded in some meaningful way (or one-sided in error). Additionally data structures in this class generally have great performance with speeds or overall efficiency that are hard to beat. Among the data structures in this classification, the bloom filter is the most straightforward as it directly extends from what we now know about hashing.

In addition to the basics of implementation (which you are coding up here), in class we also discussed that the optimally accurate bloom filter can be determined if you know any two of the following three properties: (1) Size of bloom filter (2) Number of hashes (3) Expected number of items. You will explore that relationship by implementing an integer-hash bloom filter with variable hashes, sizes, and input items and observing how the filter compares to the underlying ground truth.

Checking Out the Code

All assignments will be distributed via our release repo on github this semester. You will need to have set up your git directory to have our release as a remote repo as described in our git set up.

You can merge the assignments as they are released into your personal repo with

git pull --no-edit --no-rebase release main --allow-unrelated-histories
git push

The first git command will fetch and merge changes from the main branch on your remote repository named release into your personal. The --no-edit flag automatically generates a commit message for you, and the--no-rebase flag will merge the upstream branch into the current branch. Generally, these two flags shouldn’t be used, but are included for ease of merging assignments into your repo.

The second command will push to origin (your personal), which will allow it to track the new changes from release.

You will need to run these commands for every assignment that is released.

All the files for this lab are in the lab_bloom directory.

Part 1: The Bloom Filter Class

As every provided hash function takes an integer as input (and returns an integer as output), unlike most the assignments in this class our bloom filter is not templated! For the first part of this assignment, your task is to finish implementing this integer-hash bloom filter. The specific functions you should fill in are:

  • BF::BF() (Both constructor and copy constructor)
  • BF::add()
  • BF::contains()
  • BF::bit_union()
  • BF::intersect()

To help you debug the assignment, the BF class has an override to print the content of a bloom filter when passed to cout as well as five ‘hash’ functions of varying performance so you can explore the space of possible implementations. While you should read the doxygen to understand the specific details of each function, some tips to help you out are below.

  1. Remember that functions can be stored as variables and passed as function arguments! In this assignment we actually go one step further and pass a vector of hashFunctions, which we have type defined for you in bloom.h to be any function that takes in an integer as input and returns an integer as output.

  2. Don’t overthink add() or contains()! If you understand how a bloom filter handles collisions and the probabilistic errors possible in a BF, they should be very straightforward.

  3. The lab slides briefly review what a bit vector is and the fundamentals of bitwise union and intersection. Depending on your background you may find it easier to consider them as logical gates or the basic connectives of predicate logic. As a simple truth table:

Bit A Bit B Union(A, B) Intersection(A, B)
0 0 0 0
1 0 1 0
0 1 1 0
1 1 1 1

Part 2: False Positive Rate Experiment

Once you have a functional BF class, its time to evaluate how the accuracy in a BF is a function of its size, the number of hashes, and the number of items inserted. As the bloom filter has one-sided error, the only mistakes it can make are when it identifies an item as being ‘present’ when it was not actually inserted into the bloom filter. Accordingly, the only measure of accuracy we need to track is the False Positive Rate. For a refresher on what that means conceptually, see the following table:

  Labeled True Labeled False
Actual True True Positive (TP) False Negative (FN)
Actual False False Positive (FP) True Negative (TN)

As seen in class (and re-iterated in the slides), \(FPR=\frac{FP}{(FP+TN)}\). From a coding point of view, we can estimate this by keeping track of the raw number of false positives and the raw number of true negatives. Then it is a simple matter of querying every number and tracking how many numbers not in our set were identified as being part of our set (FP) as well as the number of items we correctly identify as not present (TN). However in order to do this correctly, we need one additional piece of information – the size of the universe! Given this information, you can query the full range and get the correct estimate.

Warning: You should always assume the universe starts from 0 and ends at (but not including) the universe max. In other words, the universe parameter defines the size of the universe, not the final item in the universe.

Ex: A universe of 100 is the items 0 through 99.

The specific functions you should fill in are:

  • measureFPR()

Part 3: Bit Mask Tutorial

When using vector<bool>, you can use standard array indexing [] to look up the value of any bit. But when using actual bit vectors, it is much harder to access a single bit. Here you are tasked with looking up specific bits in a vector<char>. For the purposes of the lab, you will first implement this for a single character and then extend that logic to work for an array of characters:

  • getBitFromByte()
  • getBitFromArray()

Figuring out how to actually accomplish this is the point of the lab exercise, but as a hint you should be familiar with the fundamental bit-wise operators discussed in class. A brief summary of them can be found in the lecture slides.

Part 4: (Optional) Data Exploration

Given the completed functions in parts 1 and 2, you have the ability to get the FPR for any combination of hashes, bit vector sizes, and input items (as well as the size of the universe). You are encouraged to play around with different parameter sets to see if the empirically derived accuracy has an optimal accuracy that matches the equation we saw in class with k hashes, m bits, and n items inserted:

\[k = ln(2)\frac{m}{n}\]

If the accuracy you are seeing doesnt match, why do you think this is? What might you change to restore the optimality?

Submitting Your Work

The following files are used in grading:

  • bloom.cpp
  • bloom.h

All other files including any testing files you have added will not be used for grading.