lab_huffman

Heroic Huffman Trees

Due: Apr 03, 23:59 PM

Learning Objectives

  • Review fundamentals of binary trees
  • Experience using data structures for data compression
  • Practice more open-ended coding problems

Submission Instructions

Using the Prairielearn workspace, test and save your solution to the following exercises. (You are welcome to download and work on the files locally but they must be re-uploaded or copied over to the workspace for submission).

Assignment Description

In this lab, you will be exploring a different tree application (Huffman Trees), which allow for efficient lossless compression of files (also known as Huffman encoding). More generally, compressing data is a very powerful tool in data science as it allows us to more efficiently store and transmit datasets. You often use compression without even realizing it – for example PNGs are lossless images but JPGs are compressed!

The Huffman Encoding

As a fun trivia fact Huffman encoding was developed as part of a class assignment: In 1951, while taking an Information Theory class as a student at MIT, David A. Huffman was given a choice of writing a term paper on finding the most efficient binary code or taking a final exam. Huffman’s solution to the problem turned out to be provably the most efficient binary code possible, surpassing his own professor’s previous work on the topic.

To understand Huffman coding, lets imagine the following scenario. We have a four letter alphabet (A, B, C, D) and we want to store a large message written in this alphabet using the fewest possible bits. If every letter has an equal frequency, the best we can do is store every character with two bits (A=00, B=01, C=10, D=11). But what if 80% of my message is the letter A and only 1% of my message is the letter D? If I can somehow store A as a single bit (0), I can greatly reduce the storage cost of my message! As it turns out, we can’t do this without increasing the bit cost of a different character, but if we have a very infrequent character (like D!) then we can just make sure that is the character with the highest cost.

Lets see some examples of this in action below:

Building the Huffman tree

Input: “feed me more food”

Step 1: Calculate frequency of every character in the text, and order by increasing frequency. In our case we will be storing these in a sorted list called single (acting as a queue once built).

r : 1 | d : 2 | f : 2 | m : 2 | o : 3 | 'SPACE' : 3 | e : 4

Step 2: Build the tree from the bottom up. Start by taking the two least frequent characters and merging them (create a parent node for them). Store the merged characters in a new queue (the merge queue):

SINGLE: f : 2 | m : 2 | o : 3 | 'SPACE' : 3 | e : 4

MERGED: rd : 3

Step 3: Repeat Step 2 this time also considering the elements in both single and merge queues. ‘f’ and ‘m’ this time are the two elements with the least frequency, so we merge them:

SINGLE: o : 3 | 'SPACE' : 3 | e : 4

MERGED: rd : 3 | fm : 4

Step 4: Repeat Step 3 until there are no more elements in the SINGLE queue, and only one element in the MERGED queue:

SINGLE: e : 4

MERGED: rd : 3 | fm : 4 | o+SPACE : 6
SINGLE:

MERGED: fm : 4 | o+SPACE : 6 | rde: 7
SINGLE:

MERGED: rde: 7 | fmo+SPACE: 10
SINGLE:

MERGED: rdefmo+SPACE: 17

From Text to Binary

Now that we built our Huffman tree, its time to see how to encode our original message “feed me more food” into binary code.

Step 1: Label the branches of the Huffman tree with a ‘0’ or ‘1’. BE CONSISTENT: in this example we chose to label all left branches with ‘0’ and all right branches with ‘1’. In the case of our visualizations, some of the nodes may be swapped for convenience of placement. However, any path with label ‘1’ is intended to be on the right side, and any path with label ‘0’ is intended to be on the left side.

Step 2: Taking one character at a time from our message, traverse the Huffman tree to find the leaf node for that character. The binary code for the character is the string of 0’s and 1’s in the path from the root to the leaf node for that character. For example: ‘f’ has the binary code: 100

So our message “feed me more food” becomes 10001010011111111010111111110111000001111111100110110001

From Binary Code to Text

We can also decode strings of 0’s and 1’s into text using our Huffman tree. What word does the code 00001001 translate to?

What About the Rest of the Alphabet?

Notice that in our example above, the Huffman tree that we built does not have all the alphabet’s letters; so while we can encode our message and some other words like “door” or “deer”, it won’t help us if we need to send a message containing a letter that’s not in the tree. For our Huffman encoding to be applicable in the real world we need to build a huffman tree that contains all the letters of the alphabet; which means instead of using “feed me more food” to build our tree, we should use a text document that contains all letters of the alphabet to build our Huffman tree. As a fun example, here is the Huffman tree that results when we use the text of the Declaration of Independence to build it.

Lab Assignment

In this lab you are tasked with completing the remaining code blocks necessary to build a Huffman Tree and encode messages. Decoding a Huffman code is left as an optional exercise but is well worth considering if you want to challenge yourself.

getSmallest()

### getSmallest()
# Input:
# Two (potentially empty) lists of bstNodes
# The lists are sorted based on the values of the bstNodes (frequencies)
# Output
# A bstNode containing the smallest frequency in either list
bstNode getSmallest(list[bstNode] single, list[bstNode] merge):

The actual build process for the Huffman Tree has been broken into several parts with the first part build_frequency() implemented for you. The output of build_frequency() is a sorted list of bstNodes where every node has a single character. However as characters are merged together, they get placed in the second list merge. Accordingly, write a function which takes in two sorted lists and returns the minimum frequency node in either list.

Note: You must account for edge cases carefully – the very first call to getSmallest() is going to have an empty merge list and eventually there will only be a single item in merge and zero items in single.

buildHuffman()

# Input:
# A string storing the input string being encoded (instring)
# Output
# A bstNode storing the root of the full Huffman Tree
bstNode buildHuffman(string instring):

The Huffman Tree is the first tree you’ve seen which is built from the bottom up (starting from the leaves). Conceptually this is given to you as part of the provided code in setting the single list equal to a sorted collection of single nodes (corresponding to individual characters). Use getSmallest() and your understanding of Huffman Trees to progressively build the Huffman Tree by merging these leaves into new internal nodes.

The output of this function should be the root of the full tree (the node containing every character and the total frequency of all characters).

Consider: What is the Big O of building the Huffman Tree?

buildEncoder()

# INPUT:
# a bstNode storing the root of the huffman tree (node)
# a string storing the current encoding for the current node (code)
# a dictionary storing all binary encodings for each character (outDict)
# Output:
# None
# Instead outDict gets modified every time you reach a leaf
# (So after completing the function the dictionary has a key for every character)
None buildEncoder(bstNode node, string code, map(string, string) outDict):

The Huffman Tree defines the encoding of every character based on the path from root to leaf (as a series of 0s (left) and 1s (right)). But it’s sort of a pain (and computationally inefficient) to walk through the tree every time you want to encode a character. So lets walk through every node in the tree once and use that to build a dictionary of the exact mapping of characters to bits!

However unlike previous assignments, how you do this is for you to figure out. As a hint, its best done as a recursive process using a string and a dictionary as additional arguments to keep track of your current position!