mp_sketching
Supreme sketching
sketching.h File Reference

An exploration of sketching techniques on sets and PNGs. More...

Classes

class  MM
 The MinHash Matrix (MM) class converts PNGs into an indexable matrix of MinHash sketches It is built using BWPixels already implemented for you (treat pixels as luminance). More...
 

Typedefs

typedef uint64_t(* hashFunction) (int)
 

Functions

std::vector< uint64_t > kminhash (std::vector< int > inList, unsigned k, hashFunction h)
 Given a list of integers (possibly with repeated values) and a hash function, return a list containing the k-minimum hash values. More...
 
std::vector< uint64_t > khash_minhash (std::vector< int > inList, std::vector< hashFunction > hv)
 Given a list of integers (possibly with repeated values) and a vector of hash functions, return the min hash value for each hash The values should be stored in order corresponding to the input hash vector hv. More...
 
std::vector< uint64_t > kpartition_minhash (std::vector< int > inList, int part_bits, hashFunction h)
 Given a list of integers (possibly with repeated values), a partition number (an integer number of bits), and a hash function return a list containing the min value for each partition. More...
 
float minhash_jaccard (std::vector< uint64_t > mh1, std::vector< uint64_t > mh2)
 Given two minhash sketches, return the estimate of the Jaccard similarity between the two sketches. More...
 
int minhash_cardinality (std::vector< uint64_t > mh, unsigned k)
 Given a MinHash sketch, estimate the cardinality using the k-th minimum value. More...
 
float exact_jaccard (std::vector< int > raw1, std::vector< int > raw2)
 Given two raw datasets containing a vector of integers, compute the exact Jaccard similarity between datasets. More...
 
int exact_cardinality (std::vector< int > raw)
 Given a raw dataset containing a vector of integers, compute the exact cardinality of this set. More...
 
std::vector< std::tuple< int, int, int > > build_minhash_graph (std::vector< std::string > flist, unsigned numTiles, unsigned k, hashFunction h, float threshold)
 Given a list of PNGs as a vector of files, perform an all-pairs comparison and store the results as a weighted edge list graph The vertices of your graphs are files and the edge weight between each is the number of matching tiles between them Your graph will always be an undirected weighted complete graph (with no self-edges). More...
 

Variables

const uint64_t GLOBAL_MAX_INT = ~(0)
 

Detailed Description

An exploration of sketching techniques on sets and PNGs.

Function Documentation

◆ build_minhash_graph()

std::vector< std::tuple< int, int, int > > build_minhash_graph ( std::vector< std::string >  flist,
unsigned  numTiles,
unsigned  k,
hashFunction  h,
float  threshold 
)

Given a list of PNGs as a vector of files, perform an all-pairs comparison and store the results as a weighted edge list graph The vertices of your graphs are files and the edge weight between each is the number of matching tiles between them Your graph will always be an undirected weighted complete graph (with no self-edges).

NOTE: Because file locations are a pain to autograde (and store), the 'edge list' output is a vector of INTEGERS not strings. Accordingly each file needs to be represented by an integer. To do this, assign a numeric value based on the index in the input list

So the first file in flist will have the label '0' (regardless of what the string really is)

Each tuple should be organized by giving the two vertices first and then the weight between them. Ex: If file 0 and file 1 have a countMatchTiles output of 5, this would be {0, 1, 5}. Depending on your strategy, you may also output {1, 0, 5}

You should NEVER output something like {0, 0, 1}!

Parameters
flistA vector storing a collection of files to be processed into MinHashes and compared
numTilesA vector containing a list of hash functions to use in the BF
kAn unsigned integer describing the number of minima that need to be tracked (and returned)
hAn hash function that takes in an integer and returns an uint64_t
thresholdA float value between 0 and 1. A 'match' between tiles must be equal to or greater than this value.
Returns
A vector of size 3 integer tuples defining an edge list.

◆ exact_cardinality()

int exact_cardinality ( std::vector< int >  raw)

Given a raw dataset containing a vector of integers, compute the exact cardinality of this set.

Your cardinality should be a precise measure of the number of unique values in the dataset.

Parameters
rawA dataset of possibly repeating integers
Returns
The exact count of unique items in the dataset

◆ exact_jaccard()

float exact_jaccard ( std::vector< int >  raw1,
std::vector< int >  raw2 
)

Given two raw datasets containing a vector of integers, compute the exact Jaccard similarity between datasets.

Your similarity metric must ignore duplicates present in the dataset when computing the relevant cardinalities. In other words, treat your input as two sets and do a set comparison.

Parameters
raw1The first dataset being compared
raw2The second dataset being compared
Returns
A float between 0 and 1 (inclusive) storing the actual similarity between datasets

◆ khash_minhash()

std::vector< uint64_t > khash_minhash ( std::vector< int >  inList,
std::vector< hashFunction >  hv 
)

Given a list of integers (possibly with repeated values) and a vector of hash functions, return the min hash value for each hash The values should be stored in order corresponding to the input hash vector hv.

In other words, the min value of the hash at index 0 in hv should be stored at index 0 of the returning vector

NOTE: This is similar to k-minhash but using k hash functions instead of k min values.

Parameters
inListA vector of uint64_t to be run through minhash
hvA vector of hash function that each take in an integer and returns a uint64_t
Returns
A vector of uint64_t containing the min value of k distinct hashes (order matching hv)

◆ kminhash()

std::vector< uint64_t > kminhash ( std::vector< int >  inList,
unsigned  k,
hashFunction  h 
)

Given a list of integers (possibly with repeated values) and a hash function, return a list containing the k-minimum hash values.

If you generate the same hash value twice, you should only track it once for the purposes of recording minima. Your final return vector should be ordered from smallest to largest starting with the global min. In other words, the 1st minimum value should be stored at index 0 (and the k-th minimum at index k-1).

HINT: You are free to use your previous assignments to build a way of tracking k minimum values.

NOTE: If there arent enough unique hash values in the input set, you should use GLOBAL_MAX_INT to fill in the remaining positions. This is the only allowable duplicate.

Parameters
inListA vector of integers to be run through minhash
kAn unsigned integer describing the number of minima that need to be tracked (and returned)
hAn hash function that takes in an integer and returns a uint64_t
Returns
A vector of uint64_t containing the k-min hash values using one hash

◆ kpartition_minhash()

std::vector< uint64_t > kpartition_minhash ( std::vector< int >  inList,
int  part_bits,
hashFunction  h 
)

Given a list of integers (possibly with repeated values), a partition number (an integer number of bits), and a hash function return a list containing the min value for each partition.

For example, if we use one bit for our partition, we would have two partitions total corresponding to 0 / 1. If we had two bits, our partitions would be 00, 01, 10, 11. If we use n bits for our partition, we would have 2^n total partitions.

The partitions should always be based on the highest order bits.

NOTE: These are bit encodings of real values NOT a vector<bool> or equivalent. The rightmost bit is the LOWEST order bit.

If there are no items which hash within a partition, as a placeholder store the global maximum value at that location. For your convenience, GLOBAL_MAX_INT is already defined in sketching.h

Please also note that the 'partition' is purely conceptual for grouping and not an actual modification of the minhash values themselves. The return should be the full 64-bit hash value (uint64_t), NOT a truncation of the lower order bits. In other words, please do not remove the partition bits from your minhash value.

Parameters
inListA vector of integers to be run through minhash
part_bitsAn integer between 1 and 63 (inclusive) describing how many partition bits are used.
hAn hash function that takes in (and returns) an integer to be used by minhash
Returns
A vector of uint64_t containing the min hash values using one hash and k partitions

◆ minhash_cardinality()

int minhash_cardinality ( std::vector< uint64_t >  mh,
unsigned  k 
)

Given a MinHash sketch, estimate the cardinality using the k-th minimum value.

Note that k is not the same thing as the index in your vector: The 1st minimum value is the 0th index.

You should use the k-min value equation seen in class. However that equation was normalized for [0, 1] hashing. It is up to you to figure out how to modify the equation appropriately.

WARNING: Your division may result in a float value. The return value MUST ROUND UP.

TIP: Using the required method, both k-hash and k-partition minhash are only accurate if k=1 – why? Can you think of a way to improve the accuracy of these methods using ALL hash values?

Parameters
mhA kminhash storing k minimum hash values (generated from one hash or k hashes).
kThe value of k for our k-th minima value approach to estimating cardinality
Returns
An estimate of the total number of unique items in the hash

◆ minhash_jaccard()

float minhash_jaccard ( std::vector< uint64_t >  mh1,
std::vector< uint64_t >  mh2 
)

Given two minhash sketches, return the estimate of the Jaccard similarity between the two sketches.

This should be calculated directly as seen in lecture:

The intersection is the numeric count of the matches between mh1 and mh2 (the order doesnt matter)!

The union is the numeric count of the total number of unique values found when combining mh1 and mh2. That is to say if |mh1|=|mh2|=4 (k=4), and they intersect with 2 items, the union total is 6.

NOTE: In the slides we calculated Jaccard multiple ways. This is the ONLY allowed way for the assignment.

WARNING: You MUST ignore any instances of GLOBAL_MAX_INT when computing this similarity! Remember, GLOBAL_MAX_INT implies that there wasnt enough unique hashing items to finish the sketch. This is ok but shouldnt be counted as similarity, since GLOBAL_MAX_INT doesnt correspond to a real value.

Parameters
mh1The first minhash sketch being compared
mh2The second minhash sketch being compared
Returns
A float between 0 and 1 (inclusive) storing the estimated similarity between sketches