Learning Objectives
- Practice a real-world application for binary trees
- Experience using data structures for data compression
- Store a text encoding in a dictionary (and use it)
Submission Instructions
Using the Prairielearn workspace, test and save your solution to the following exercises. (You are welcome to download and work on the files locally but they must be re-uploaded or copied over to the workspace for submission).
You will only need to modify the following files:
lab_huffman.ipynb
Assignment Description
In this lab, you will be exploring a different tree application (Huffman Trees), which allow for efficient lossless compression of files (also known as Huffman encoding). More generally, compressing data is a very powerful tool in data science as it allows us to more efficiently store and transmit datasets. You often use compression without even realizing it – for example PNGs are lossless images but JPGs are compressed!
The Huffman Encoding
As a fun trivia fact Huffman encoding was developed as part of a class assignment: In 1951, while taking an Information Theory class as a student at MIT, David A. Huffman was given a choice of writing a term paper on finding the most efficient binary code or taking a final exam. Huffman’s solution to the problem turned out to be provably the most efficient binary code possible, surpassing his own professor’s previous work on the topic.
To understand Huffman coding, lets imagine the following scenario. We have a four letter alphabet (A, B, C, D) and we want to store a large message written in this alphabet using the fewest possible bits. If every letter has an equal frequency, the best we can do is store every character with two bits (A=00, B=01, C=10, D=11). But what if 80% of my message is the letter A and only 1% of my message is the letter D? If I can somehow store A as a single bit (0), I can greatly reduce the storage cost of my message! As it turns out, we can’t do this without increasing the bit cost of a different character, but if we have a very infrequent character (like D!) then we can just make sure that is the character with the highest cost.
Lets see some examples of this in action below:
Building the Huffman tree
Input: “feed me more food”
Step 1: Calculate frequency of every character in the text, and order by increasing frequency. In our case we will be storing these in a sorted list of treeNodes()
called SINGLE
, where we store both (key, value)
pairs as ('character', 'count')
.
r : 1 | d : 2 | f : 2 | m : 2 | o : 3 | 'SPACE' : 3 | e : 4
Step 2: Build the tree from the bottom up! To do this, we take the smallest frequency treeNode and remove it from the list! We then do this a second time to find the second smallest frequency treeNode and merging the two treeNode into one ‘new’ treeNode by concatenating their keys (characters), summing their values (frequency), and setting the left child to the smallest frequency treeNode and the right to the second smallest. See below for an example of the merge of ‘r’:1 and ‘d’:2 into ‘rd’:3! This new treeNode is then added to a new list – the MERGED
list.
SINGLE: f : 2 | m : 2 | o : 3 | 'SPACE' : 3 | e : 4
MERGED: rd : 3
Step 3: Repeat Step 2 this time also considering the elements in both SINGLE
and MERGED
lists. If implemented correctly the smallest value will always be either the front item in the SINGLE
list or the MERGED
lists and new items will always be added to the back. Hang on – isn’t a list that only removes from FRONT and only adds at the BACK called something special? This is also why you must get the first smallest treeNode and THEN the second smallest, because the act of finding (and removing) the first smallest may change the second smallest value. In the event of a tie between SINGLE
and MERGED
, you should always pick the SINGLE
character.
In the context of our example, ‘f’ and ‘m’ this time are the two elements with the least frequency, so we merge them:
SINGLE: o : 3 | 'SPACE' : 3 | e : 4
MERGED: rd : 3 | fm : 4
Step 4: Repeat Step 3 until there are no more elements in the SINGLE queue, and only one element in the MERGED queue. At this point we have a full Huffman Tree:
SINGLE: e : 4
MERGED: rd : 3 | fm : 4 | o+SPACE : 6
SINGLE:
MERGED: fm : 4 | o+SPACE : 6 | rde: 7
SINGLE:
MERGED: rde: 7 | fmo+SPACE: 10
SINGLE:
MERGED: rdefmo+SPACE: 17
From Text to Binary
Now that we built our Huffman tree, its time to see how to encode our original message “feed me more food” into binary code.
Step 1: Label the branches of the Huffman tree with a ‘0’ or ‘1’. BE CONSISTENT: in this example we chose to label all left branches with ‘0’ and all right branches with ‘1’. The tool used to generate visualizations of trees may sometimes invert the order but not the labeling – any path with label ‘1’ is intended to be on the right side, and any path with label ‘0’ is intended to be on the left side. For the purposes of autograding, remember that ‘0’ (left) should be the smaller of the two nodes being merged.
Step 2: Taking one character at a time from our message, traverse the Huffman tree to find the leaf node for that character. The binary code for the character is the string of 0’s and 1’s in the path from the root to the leaf node for that character. For example: ‘f’ has the binary code: 100
So our message “feed me more food” becomes 10001010011111111010111111110111000001111111100110110001
Efficiency of Huffman Encoding Notice that in our Huffman tree, the more frequent a character is, the closer it is to the root, and as a result the shorter its binary code is. Can you see how this will result in compressing the encoded text?
From Binary Code to Text
We can also decode strings of 0’s and 1’s into text using our Huffman tree. What word does the code 00001001
translate to?
What About the Rest of the Alphabet?
Notice that in our example above, the Huffman tree that we built does not have all the alphabet’s letters; so while we can encode our message and some other words like “door” or “deer”, it won’t help us if we need to send a message containing a letter that’s not in the tree. For our Huffman encoding to be applicable in the real world we need to build a huffman tree that contains all the letters of the alphabet; which means instead of using “feed me more food” to build our tree, we should use a text document that contains all letters of the alphabet to build our Huffman tree. As a fun example, here is the Huffman tree that results when we use the text of the Declaration of Independence to build it.
Lab Assignment
In this lab you are tasked with completing the remaining code blocks necessary to build a Huffman Tree and encode messages. Decoding a Huffman code is left as an optional exercise but is well worth considering if you want to challenge yourself.
getSmallest()
### getSmallest()
# Input:
# Two (potentially empty) lists of bstNodes
# The lists are sorted based on the values of the bstNodes (frequencies)
# Output
# A bstNode containing the smallest frequency in either list
bstNode getSmallest(list[bstNode] single, list[bstNode] merge):
The actual build process for the Huffman Tree has been broken into several parts with the first part build_frequency()
implemented for you. The output of build_frequency()
is a sorted list of treeNodes()
where every node has a single character. However as characters are merged together, they get placed in the second list merge. Accordingly, write a function which takes in two sorted lists and returns the minimum frequency node in either list.
Note: You must account for edge cases very carefully in this assignment. The autograder for this assessment is very detailed in what it is checking. Make sure you understand and pass each of the hardcoded edge cases. This includes:
- Make sure that you remove the smallest item from the list while returning it. Like a dequeue or pop!
- Make sure that your code can handle when
SINGLE
orMERGED
are empty. - Make sure that your ties favor items in
SINGLE
.
buildHuffman()
# Input:
# A string storing the input string being encoded (instring)
# Output
# A bstNode storing the root of the full Huffman Tree
bstNode buildHuffman(string instring):
The Huffman Tree is the first tree you’ve seen which is built from the bottom up (starting from the leaves) and a great example of a binary tree where the structure has significant meaning. Use getSmallest()
and your understanding of Huffman Trees to progressively build the Huffman Tree by merging pairs of treeNodes into one new treeNode. Remember when doing so to:
- Concatenate the characters together with the characters from the ‘smallest’ treeNode coming first.
- Sum the frequencies from each child treeNode into the parent.
- Set the new node’s children equal to the left and right treeNode used to produce the new merged treeNode.
Doing this repeatedly should result in a single merged treeNode that is the root of the full tree (the node containing every character and the total frequency of all characters).
Consider: What is the Big O of building the Huffman Tree?
Tie Breaking
To make sure that there is a unique correct answer (and to help facilitate grading), make sure that when building internal nodes, the left child has the smallest frequency and the right child has the second smallest frequency. In other words, the first answer to getSmallest()
should always be on the left.
In getSmallest()
, break ties by taking the front of the singleQueue
!
buildEncoder()
# INPUT:
# a bstNode storing the root of the huffman tree (node)
# a string storing the current encoding for the current node (code)
# a dictionary storing all binary encodings for each character (outDict)
# Output:
# None
# Instead outDict gets modified every time you reach a leaf
# (So after completing the function the dictionary has a key for every character)
None buildEncoder(bstNode node, string code, map(string, string) outDict):
The Huffman Tree defines the encoding of every character based on the path from root to leaf (as a series of 0s (left) and 1s (right)). We can certainly search through the entire tree every time we want to encode a specific character, but that’s tremendously inefficient. Instead, lets walk through every node in the tree once and use that to build a dictionary of the exact mapping of characters to bits!
However unlike previous assignments, how you do this is for you to figure out. As a hint, its best done as a recursive process using a string and a dictionary as additional arguments to keep track of your current position!