mp_huffman

Heinous Huffman Trees

Due: Nov 03, 23:59 PM

Learning Objectives

  • Review fundamentals of the tree ADT and introduce dictionaries
  • Explore a non-standard tree construction strategy
  • Experience the utility of data structures for data compression

Getting set up

From your CS 277 git directory, run the following:

git fetch release
git merge release/mp_huffman -m "Merging initial mp_huffman files"

You will find the relevant code and data as sub-folders within mp_huffman.

Assignment Description

In this mp, you will be exploring a different tree application (Huffman Trees), which allow for efficient lossless compression of files. Effective compression tools are great for both long-term storage of information and also transmission. When choosing a compression method, a lossless compression is reversible and can restore the full original dataset. A lossy compression will usually result in greater space gains but at the cost of losing information – a lossy compression cannot be restored to the original dataset, though good lossy compression strategies can produce a very close approximation.

The Huffman Encoding

In 1951, while taking an Information Theory class, David A. Huffman was given a choice between writing a term paper or a final exam. The paper had to be about the process of finding the most efficient binary encoding. And Huffman solved the problem (and published it) rather than take the exam. His solution, the Huffman coding, takes an input text (or any alphabet-based dataset) and generates a binary code (a string of 0’s and 1’s) that represents that text.

By building a Huffman Tree on an input text, you will get the optimal Huffman code for that dataset. However it is often the case that you would like to ‘train’ (build) a Huffman Tree on one dataset and then use it to encode other messages. In this MP, you are tasked with implementing several key functions for a pre-written Huffman Tree implementation.

Step 1: Parse the input text file

# INPUT:
# inFile, a string storing the input name of a text file
# OUTPUT:
# A list of (key, value) tuples stored in order of increasing frequency (smallest first)
List(tuple<str, int>) parseFile(str inFile)

Building a Huffman code requires first calculating the frequency of all characters in the text and storing them as (character, frequency) pairs. To assist you with this process, you are encouraged to use the freqSorter class. In particular, the addChar() function will increment an internal counter for that character by one and the getList() function will return a sorted set of tuples.

HINT: Your part of the code needs to handle identifying each individual character (with repetition) within the file. Remember that " " (‘SPACE’) is a character too!

Example:

Input: “feed me more food”

Output: A list of tuples (character, frequency)

(r, 1) | (d, 2) | (f, 2) | (m, 2) | (o, 3) | ('SPACE', 3) | (e, 4)

Step 2: Build the Huffman Tree

# INPUT:
# A list of (key, value) tuples stored in order of increasing frequency (smallest first)
# OUTPUT:
# The root node of a Huffman Tree built on the input character frequencies
treeNode HuffmanTree.buildHuffman(List(tuple<str, int>) char_freq_list)

Given the character frequencies as a sorted list, the next step is to construct a binary tree from the bottom up. Start by making a treeNode for each character (and its frequency) and store them in a sorted list by frequency. (Hint: The input list should already be sorted in that order) Then, while there is more than one treeNode in the list, take the two least frequent treeNodes and merge them.

Merge two nodes by creating a new third treeNode that has the combined character and frequency count of both. This internal treeNode should also have two children – the left child should be the smaller of the two nodes merged and the right child should be the other. Finally, you should add this new merged node to the list of possible nodes that still need to be merged. (The process stops when there is only one node left in the list – a root which stores the total set of characters and the total frequency count in the dataset).

In this way you gradually build a tree from the bottom up by repeatedly merging.

HINT: The provided nodeSorter class will be very helpful for this function. If used correctly, it will store a list of all available nodes in the merge process. You should familiarize yourself with the enqueue and dequeue functions in particular, which will add any new nodes you create to the list (and then sort for you!) and remove and return the smallest node from the list respectively.

Example: (Note that the tree visualization will sometimes ‘flip’ 0 and 1, but this is merely a visual bug – the ‘smaller’ character is always assigned 0.)

First Merge: Removes ‘r’ and ‘d’ and adds new node ‘rd’.

LIST: (f, 2) | (m, 2) | (o, 3) | ('SPACE', 3) | (rd, 3) | (e, 4)

Second Merge: Removes ‘f’ and ‘m’ and adds new node ‘fm’:

LIST: (o, 3) | ('SPACE', 3) | (rd, 3) | (e, 4) | (fm, 4)

Third Merge: Removes ‘o’ and ‘SPACE’ and adds new node ‘o+SPACE’

LIST: (rd, 3) | (e, 4) | (fm, 4)| (o+SPACE, 6)

Fourth Merge: Removes ‘rd’ and ‘e’ and adds new node ‘rde’

LIST: (fm, 4)| (o+SPACE, 6) | (rde, 7)

Fifth Merge: Removes ‘fm’ and ‘o+SPACE’ and adds new node ‘fmo+SPACE’

LIST: (rde, 7) | (fmo+SPACE, 10)

Sixth Merge: Removes ‘rde’ and ‘fmo+SPACE’ and adds new node ‘rdefmo+SPACE’

LIST: (rdefmo+SPACE, 17) # We are done! This is the root of our full Huffman Tree!

Step 3: Build an encoder dictionary!

# INPUT:
# None (self.root stores the Huffman Tree!)
# OUTPUT:
# Nothing
# However the function should produce a dictionary (variable 'self.encoder')
# that stores the 'binary' Huffman encoding of each character.
#
# The dictionary key is the character (str) and the value is a string of 1s and 0s
dictionary(<str, str>) HuffmanTree.buildEncoder()

Once the tree is built, you have your full Huffman code – but you need to traverse the tree in order to use it. Write a recursive traversal of the Huffman Tree that remembers the ‘path’ it took – anytime you recurse to the left, add a "0" to the code. Anytime you recurse to the right, add a "1" to the code. When you hit a leaf, insert a key, value pair to the dictionary (self.encoder) where the key is the character stored in the leaf’s node and the value is the code you’ve built through the recursion.

For example, the encoder for the earlier example is gradually built by traversing and contains the following key, value pairs:

encoder["e"] = "01"
encoder["r"] = "000"
encoder["d"] = "001"
encoder["f"] = "100"
encoder["m"] = "101"
encoder["o"] = "110"
encoder["SPACE"] = "111"

Step 4: Encode text!

# INPUT:
# An input string (the message being encoded)
# OUTPUT:
# An output string (the message being decoded)
str HuffmanTree.encode(str inString)

The final step of using the Huffman Tree is to use HuffmanTree.encoder (the dictionary) to encode any string (including the original dataset!). To keep things simple, you only need to implement a single function that takes in some arbitrary string and returns a much longer encoded string. Note that there are no gaps or spaces in the binary string – by definition it is just 1s and 0s.

For example:

INPUT: “feed me more food”

OUTPUT: 10001010011111111010111111110111000001111111100110110001

Step 5: Decode text! (Optional!)

# INPUT:
# inString, the 'binary' string (consisting of just 1s and 0s) being decoded
# OUTPUT:
# A string containing the original decoded text
str HuffmanTree.decode(str inString)

If you are interesting in a challenge, consider how you might decode a Huffman code given the Huffman Tree. As a hint, the path of the Huffman Tree to get to the leaf ‘f’ is right-left-left which just so happens to have the encoding 100.

HINT: Conceptually understanding how decode works is much easier than actually coding it. There are multiple ways to implement this function and most of the recursive ones can be somewhat tricky. My own solution converted the string to a list (which I found easier to handle) but you can also solve this by using a recursive strategy similar to quickSort – or go your own way.

Additional trivia

Notice that in our example above, the Huffman tree that we built does not have all the alphabet’s letters; so while we can encode our message and some other words like “door” or “deer”, it won’t help us if we need to send a message containing a letter that’s not in the tree. For our Huffman encoding to be applicable in the real world we need to build a Huffman tree that contains all the letters of the alphabet; which means instead of using “feed me more food” to build our tree, we should use a text document that contains all letters of the alphabet to build our Huffman tree. As a fun example, here is the Huffman tree that results when we use the text of the Declaration of Independence to build it.