Learning Objectives
- Implement multiple forms of hashing
- Determine collision frequency and run-time differences in hashing approaches
Submission Instructions
Using the Prairielearn workspace, test and save your solution to the following exercises. (You are welcome to download and work on the files locally but they must be re-uploaded or copied over to the workspace for submission).
You will only need to modify the following files:
lab_hash.ipynb
Assignment Description
Hashing is one of the most important concepts in computer science (and my personal favorite topic). A well-defined general hash function converts any object in the universe into a clear numeric address that can then be used to store the relevant details of that object. And – if this is done well – all of this can be done in near-constant time!
In this lab, you will be tasked with implementing insert and find for three different hashing strategies discussed in class. You are strongly encouraged to play around with the size of the hash table you are building, the hash functions you are using, as well as the contents of what you are passing in (the provided code contains multiple examples).
Part 1: Separate Chaining
# INPUT:
# inList, A list of strings
# hash, A hash function that will always be h1, h2, or h3 (given hash functions)
# m, An integer defining the size of the output list
# OUTPUT:
# A hash table (as a list of lists) hashed with separate chaining
# That is to say it is a list where every index is a list storing the strings
# that hash to that specific index
def hash_chain(inList, hash, m):
# INPUT:
# ht, A hash table of size m (produced by hash_chain)
# query, The string being searched for
# hash, A hash function (provided as h1, h2, or h3)
# m, An integer defining the size of the output list
# OUTPUT:
# An integer denoting the number of steps it took to find the query
# OR -1 if query was not found.
def find_chain(ht, query, hash, m):
The first (and most straightforward) hashing strategy is separate chaining. Given an input hash function and array size, hash_chain()
should insert every object in the inputList into a two-dimensional list of lists representing the array itself and the corresponding chains. find_chain()
should return the number of steps it took to find the query using separate chaining (or -1 if the query is not found). To match the autograder, you must insert from the front (as though our list was a linked list).
To be clear on the dimensions of the hash table, the table should be initialized to m
empty lists and then each element is inserted. Thus you could have a single list of n
elements or a better distribution of values based on the hash function in question. More directly: Your output should have m
‘rows’ but is not guaranteed or even likely to be a square matrix.
Consider: What relationship exists between the distribution of our inputs at different hash values and the overall runtime?
NOTE: In all of these functions, one of the input itself is a function. To go full circle to the earliest days of the class, everything in Python is an object – including functions! We haven’t seen this in a while so this is a nice review.
Part 2: Linear Probing
# INPUT:
# inList, A list of strings
# hash, A hash function that will always be h1, h2, or h3 (given hash functions)
# m, An integer defining the size of the output list
# OUTPUT:
# A hash table as a list hashed with linear probing
# That is to say it is a list where every index is a string that hashed (with linear probing)
# to that specific index. If there is no string stored at that index, it has the value None
def hash_linear(inList, hash, m):
# INPUT:
# ht, A hash table of size m (produced by hash_linear)
# query, The string being searched for
# hash, A hash function (provided as h1, h2, or h3)
# m, An integer defining the size of the output list
# OUTPUT:
# An integer denoting the number of steps it took to find the query
# OR -1 if query was not found.
def find_linear(ht, query, hash, m):
Linear probing resolves a hash collision at an index i by attempting to store the value at position i+1 (and repeating until an open space has been found). You should implement both hash_linear()
and find_linear()
to take an arbitrary hash function as input and resolve hash collisions using linear probing.
HINT: You will need to have TWO break conditions for when it is clear that the object isn’t present – one for when the linear probing search identifies that the object was not inserted and one for the edge case where the entire table is full (but the object also isn’t present). How can you detect the former? The latter?
Part 3: Double Hashing
# INPUT:
# inList, A list of strings
# hash1, A hash function that will always be h1, h2, or h3 (given hash functions)
# hash2, A hash function that will always be h1, h2, or h3 (but different from h1)
# m, An integer defining the size of the output list
# OUTPUT:
# A hash table as a list hashed with linear probing
# That is to say it is a list where every index is a string that hashed (with linear probing)
# to that specific index. If there is no string stored at that index, it has the value None
def hash_double(inList, hash1, hash2, m):
# INPUT:
# inList, A list of strings
# query, The string being searched for
# hash1, A hash function that will always be h1, h2, or h3 (given hash functions)
# hash2, A hash function that will always be h1, h2, or h3 (but different from h1)
# m, An integer defining the size of the output list
# OUTPUT:
# An integer denoting the number of steps it took to find the query
# OR -1 if query was not found.
def find_double(ht, query, hash1, hash2, m):
Double hashing attempts to reduce the frequency of repeated hash collisions by giving each unique object a different pattern for resolving collisions based on a second hash h2. Unlike linear probing, it is entirely possible for an object to be impossible to insert despite having open positions still in the hash table. When inserting and finding into the hash table, you should break any operation after attempting to insert or find m
times.
To be clear here: double hashing using the provided hash functions can occassionally result in inserts that are impossible. For example, if the second hash value is 0 it doesnt matter how many times you try to rehash, it will always produce the same number. Your solution should be aware of this limitation and – when it cannot insert after m
iterations – stop. Failure to do this will likely break the autograder (which has random tests that will unpredictably result in this behavior).