Learning Objectives

Practice practical hashing in Python
Use set similarity metrics for real world data
Explore sketching techniques like the Minhash Sketch and Bloom Filter

Getting set up

While most of the mini-project is autograded (and should be submitted like any regular assignment), Prairielearn does not have the physical memory or speed to do meaningful data science. Accordingly, when you submit part 2 of this assignment, you will do so by saving your dataset, your modified code, and your output to a github repository.

Guide: How to setup and use Github in CS 277

For part 1, you will only need to submit the following on Prairielearn:

sketching.ipynb

You will only need to edit the following functions:

minHash()
bloomFilter()
jaccard()
bf_jaccard()
bf_find()

For part 2, you will need to submit:

A short proposal detailing your dataset and expected output tables
A link to your git repo

Assignment Description

Hashing (and hash functions) are some of the most useful tools in computer science. They allow for very fast and efficient data storage structures and are simple to code and use (as long as you are not writing your own hash function!). They are also a key component in an entire other field of data structures – that of lossy storage structures. Here the hash function is used as a way to systematically reduce the input dataset, producing a ‘sketch’ of the original dataset. These ‘sketches’ are non-reversible reductions of the original dataset; in other words they do not store the full dataset and cannot be used to recover the full dataset. However they store an accurate approximation of the original data and can be used (with some loss of accuracy) as a way of estimating things like set similarity or the presence or absence of a specific value.

In this mini-project, you will create two sketches: the Minhash sketch and the Bloom Filter and compare how they can be used to estimate the similarity between collections of sets. (Accordingly your dataset for this mini-project must be a group of objects that can each be defined by a set of overlapping values)

NOTE: This assignment requires two new packages to run and uses a non-standard workspace with both the mmh3 hash hash function package and the BitVector package installed. If you want to use your code locally, you must install both by running the following commands in your terminal / command prompt:

pip3 install mmh3
pip3 install BitVector

Please be sure you are using an updated version of Python when using this install! BitVector requires Python 3.9 or higher.

Part 1: Building sketches with mmh3 (45 Points)

The first part of the assignment is implementing the two required sketches. To do this, use the provided seededHash() function:

#INPUT:
# A string value storing the value being hashed
# An optional argument seed, which sets the random seed used to generate the hash function
#OUTPUT:
# An integer corresponding to the hash of the input string 'value'
int seededHash(string value, int seed=None):

When testing your code (or the hash function) you will likely not need to ever fill in the ‘seed’ value. However when coding both minHash() and bloomFilter() make sure you pass along the optional seed argument. If you don’t, you will not pass the autograder!

For those interested, by providing a seed to the hash function we ensure that every time we run the code it will use the same hash and return the same values. Hashes such as Python’s built-in hash (hash()) don’t have this property. To be clear, Python will generate a new hash every time you run it but that hash will still be deterministic for a single run (otherwise your built-in dictionaries and sets which use hash() wouldnt work!). However it makes providing meaningful autograding feedback using such built-ins basically impossible as you will never be able to exactly reproduce the hash functions the autograder is using.

NOTE: Most of your functions will use seededHash and are described as taking in ‘hashable values’. All of the tests used by the autograder will use strings however you may choose to do some other hashable type for part 2. As long as your data type works for mmh3, this code should work out of the box for your individual data science projects.

minHash()

#INPUT:
# A set (inSet) storing the collection of hashable values that define a single object
# An integer (k) storing the number of hash minima you need to keep track of
# An optional argument (seed) which sets the random seed used to generate the hash function
#OUTPUT:
# A set of integers storing the k-minimum hash values produced by passing each item in inSet through seededHash
set(int) minHash(set(string) inSet, int k, int seed=None):

A k-minhash sketch is constructed by hashing every object in an input set and storing the minimum k elements. For example:

Input: minHash(set(["A", "B", "C", "D", "E", "F", "G", "Z"]), k=2, seed=10)

Output: {-2073038243, -1959351497}

This is because for seed=10, the hash values for each string is as follows:

A -1428171452
E -1313617005
F 1511673451
D -399054952
G 1249406318
Z -2073038243
B -1959351497
C -904532356

So the minimum two hash values is our sketch of the original input set.

HINT: A good implementation of minhash would only use the memory required to track the k-minimum values. However you are not required to be efficient, merely correct. Use whatever logic (or Python built-ins) makes sense to you to keep track of and return the minimum values.

bloomFilter()

#INPUT:
# A set (inSet) storing the collection of hashable values that define a single object
# An integer (m) storing the size of the bloom filter being constructed
# Two optional integer arguments (hs1 and hs2) storing hash seeds for the BF's first and second hash
#OUTPUT:
# A BitVector of size m containing all input items hashed using both hs1 and hs2 seeds
BitVector bloomFilter(set(string) inSet, int m, int hs1=0, int hs2=10):

A bloom filter is constructed from an input set by hashing each object in the input set. For example:

Input: bf1 = bloomFilter(set(["1", "2", "3", "4", "5", "6", "7", "8"]), 50, 3, 7)

Output: 00010010000000000000010000001100010111010000101101

Note that we inserted 8 objects using two ‘hashes’ (different seeds lead to different hash functions) but there are only 14 1’s in the bloom filter – two of our 16 total insertions were hash collisions with a previously inserted item. That’s okay (as is the fact that we don’t know exactly how each item hashed by looking at it)!

HINT: The BitVector data structure is very user friendly and has most of the same functionalities as a Python list. That is to say, you can get its length, look up (and set values at) individual indices, and many other useful functions using identical methods that you’ve already seen in a list. It also is much smaller than the equivalent list would be – very much an authentic bloom filter in Python!

jaccard()

#INPUT:
# Two input sets (inSet1 and inSet2) containing sets of the same hashable data type
#OUTPUT:
# A float value between 0 and 1 corresponding to the Jaccard similarity between input sets
# NOTE: You can use this function to calculate the exact similarity between raw datasets
# and also the Minhash similarity (since a Minhash sketch is *literally* just a set of integers)
float jaccard(set(string) inSet1, set(string) inSet2):

Input: sim = jaccard(set(["1", "2", "3", "4", "5", "6", "7", "8"]), \
 set(["1", "2", "3", "4", "9"]))

Output: 0.4444444444444444

Mathematically this can also be found by noting that there are four objects intersecting and 9 objects in the union (or 4/9).

NOTE: This function will work for all sets, including those produced by your Minhash functions.

bf_jaccard()

#INPUT:
# Two input BitVectors (bf1 and bf2) containing two equal length bloom filters
#OUTPUT:
# A float value between 0 and 1 corresponding to the Jaccard similarity between input sets
float bf_jaccard(BitVector bf1, BitVector bf2):

The Jaccard coefficient (Jaccard similarity) is calculated by finding the total number of items in the intersection of two sets and dividing that number by the total number of items in the union of two sets. When it relates to a bloom filter, there is no way of recovering the original set values used to construct it. Instead you can think of the bloom filter as a set of indices where an index i is in the ‘bloom filter set’ if the bit at i is set to one.

Using this interpretation, a bloom filter version of Jaccard is possible: the intersection count is the number of bits in which both bf1 and bf2 have values of 1 and the union can be estimated by the combined total number of 1 bits that exist in either filter.

Input: 
bv1 = bv.BitVector(bitstring = '10101')
bv2 = bv.BitVector(bitstring = '01110')     
sim = bf_jaccard(bv1, bv2)

Output: 0.2

Input: 
bv1 = bv.BitVector(bitstring = '00101')
bv2 = bv.BitVector(bitstring = '00110')     
sim = bf_jaccard(bv1, bv2)

Output: 0.3333333...

The top example there are five total one bits in the union filter (11111).

The bottom example there are only three total one bits in the union filter (00111).

In both there is only a single bit in the intersection (00100).

bf_find()

#INPUT:
# A BitVector object bf, containing a bloom filter
# A hashable data type (probably string) val, containing the value being searched for
# Two optional integer arguments storing hash seeds for the BF's first and second hash
#OUTPUT:
# A bool (True or False) based on whether or not val is present in BF.
# NOTE: The Bloom Filter is probabilistic and some of the time will return 'True' 
# even if the item is not present.
bool bf_find(BitVector bf, string val, int hs1=0, int hs2=10):

As you will undoubtedly see, the bloom filter is a significantly less accurate measure of overall similarity than the Minhash sketch. This makes sense since the bloom filter is designed to enable search for specific items in a dataset not for direct comparisons between sets. Accordingly while this function won’t have much of a role in your actual data analysis, write a short function that can actually do bloom filter lookup. As a reminder, this is an O(1) operation consisting of hashing some input value with each hash in the bloom filter and checking to see if every one of those hash locations are set to one.

REMINDER: The bloom filter is probabilistic! You should make sure that your find will always return True if you added an object to a dataset but you may return True even if the item was not actually inserted. Test your function using the autograder or small scale bloom filters and simple hash functions so you can manually work out when you may encounter an error.

Input: 
bv1 = bv.BitVector(bitstring = '10101')

Output:
print(bf_find(bv1, "4")) #returns False
print(bf_find(bv1, "5")) #returns True

Part 2: Set (the data type) Dataset Exploration (30 points)

If you complete part 1, congratulations! You now have a Python script capable of creating meaningful sketches of very large input datasets. These scripts are perfect for comparing very complex objects that can be represented by a collection of values!

Your next task is to find such a dataset and use your existing code base to visualize it. For this mini-project, you will need to find multiple datasets (or one dataset with clear partitionings) and directly compare the similarity between meaningful objects. Your visualization for this mini-project must be three tables containing all possible pairwise combinations of similarity found using raw set similarity, minhash similarity, and bloom filter similarity.

Find and describe a dataset

To describe a dataset for this mini-project, you must provide the following information as directly and succinctly as possible:

Dataset Link: Where did you find your dataset? Provide a direct link to its source but do not give the download link directly. (The website you link should ideally include some information about the dataset you have selected). Your dataset must contain at least five unique objects, clearly defined and meaningfully distinct.

Dataset Format: What is the format of your input file? At minimum, you must explain what the features or values are for each object. There should be some overlap in features between objects (don’t submit five objects with 0 similarity).

Dataset Visualization: You can choose how to display your tables as long as they have clearly labeled rows and columns and directly show the similarity value for every possible pair in your dataset. Consider carefully what values of k and m you plan to use for your final data analysis. You will need to justify it!

A quick example:

Link: I plan to visualize the similarity between the top 5 books downloaded yesterday on Project Gutenberg as of Monday February 20th. They are:

Romeo and Juliet (https://www.gutenberg.org/ebooks/1513)
A Room with a View (https://www.gutenberg.org/ebooks/2641)
Middlemarch (https://www.gutenberg.org/ebooks/145)
Little Women; Or, Meg, Jo, Beth, and Amy (https://www.gutenberg.org/ebooks/37106)
The Enchanted April (https://www.gutenberg.org/ebooks/16389)

Format: I will treat each book as its own unique object where the collection of words in the book are the ‘features’ I will be comparing. The download links have several optional formats for downloading; I will be using plaintext format for easy parsing of words (using all whitespace characters are seperators). I will count the total number of words in each book to estimate the size of bloom filters and I will try five different values of k for minhash sketching (5, 50, 100, 500, and 1000).

Visualization: I don’t have to describe a visualization for this mini-project but I will visualize this using a LaTeX table and saving as a PDF because that’s how I make manual tables that are human-readable.

Submit your data visualization (and code)

While the code you produced in part 1 may very well work without any modifications for your chosen datasets, you will likely have to write additional code to parse your input data and visualize your output. With that in mind, you will need to submit your full processing pipeline (including a copy of your own implementation of part 1) as well as your final visualization separately. WARNING: Be very careful not to modify your submission to part 1 when completing part 2! Even minor changes to the functions can break the autograder and cost you significant points.

Your final visualization should be one PDF containing the Jaccard similarity metric comparisons in three ways (raw data, Minhash, and Bloom Filter) organized as tables. Each table must be clearly labeled with a title and every entry in the table (both row and column) must be labeled and human-readable.

Your final code submission should be directly runnable so be sure your github repo includes either your full dataset or a small subset of your dataset (if the full dataset is too large for github).

Finally, you must submit a short justification for your choice in values for k (the number of minima to keep track of for Minhash) and m (the number of bits in your bit vector). This justification can be experimentally derived (you must provide evidence of how your accuracy changes as you adjust both k and m) or can be mathematically / logically described.

mp_sketching

Stunningly Stylish Sketches