lab_dict

Devious Dictionaries

Assignment Description

This lab is all about making you think with dictionaries. Dictionaries (aka maps or associative arrays) are an abstract data type which stores pairs of data and can support operations like insert, find, and remove. We generally call these (key, value) pairs. The idea is to associate a key with a certain value. Hence, dictionaries are useful for when you want to quickly “lookup” a value associated with a certain key. The name dictionary comes from the more familiar physical dictionary, where you lookup the definitions of words. If we were to model a physical dictionary with an ADT dictionary the keys would be the words, and the values would be their definitions.

Since dictionaries are abstract in nature, they have no prescription for what underlying structure should be used to implement them. Common choices are tree-based structures (e.g. balanced binary search trees, BTrees, &c.) or some sort of hash table. Different underlying implementations have different performance considerations, and which you use in the real world will depend on things like desired performance and what is readily available.

For this lab, we’re not going to concern ourselves with the underlying implementation of dictionaries (we’ve already done that in previous labs). Instead, we’ll be using dictionary types that have already been defined for us. In C++ there are two “built-in” dictionary types: std::map and std::unordered_map. Both exist in the wonderful land of the Standard Template Library (STL). The STL is a subset of the C++ Standard Library (the library a compiler must implement to be standard-compliant) which provides data containers. As the name implies, these containers are templated and thus can contain any type (which meets the restriction of the container). You’ve probably experienced at least one of the STL types so far: std::vector, which is pretty much just a super fancy array. In general, the STL types are very convenient as they provide a lot of functionality (e.g. sorting, automatic resizing) that you don’t have to bother implementing yourself.

Dictionaries are amongst some of the most important abstract data types that are used in the real world. Therefore, this lab is perhaps one of the most useful labs from a real world perspective. Mastering the concepts that you see in this lab will be a tremendous help for most technical interviews and programming problems in the future.

Now that we’ve talked about what dictionaries and underlined their importance, let’s move on the actual contents of this lab. This lab is composed of a bunch of different components. We suggest you do them in the order that we describe them over here.

Lab Insight

Dictionaries are important data structures that allow you to quickly search for a value given a key. One important use for dictionaries is memoization, which is used in dynamic programming. This basically means just storing the intermediate values of an algorithm to speed up later computations. This concept is discussed more below. To learn more about the practical uses of dictionaries, CS 374, CS 473 will delve deeper into the applications.

Checking Out the Code

All assignments will be distributed via our release repo on github this semester. You will need to have set up your git directory to have our release as a remote repo as described in our git set up.

You can merge the assignments as they are released into your personal repo with

git pull --no-edit --no-rebase release main --allow-unrelated-histories
git push

The first git command will fetch and merge changes from the main branch on your remote repository named release into your personal. The --no-edit flag automatically generates a commit message for you, and the--no-rebase flag will merge the upstream branch into the current branch. Generally, these two flags shouldn’t be used, but are included for ease of merging assignments into your repo.

The second command will push to origin (your personal), which will allow it to track the new changes from release.

You will need to run these commands for every assignment that is released.

All the files for this lab are in the lab_dict directory.

Preparing Your Code

This semester for MPs we are using CMake rather than just make. This allows for us to use libraries such as Catch2 that can be installed in your system rather than providing them with each assignment. This change does mean that for each assignment you need to use CMake to build your own custom makefiles. To do this you need to run the following in the base directory of the assignment. Which in this assignment is the lab_dict directory.

mkdir build
cd build

This first makes a new directory in your assignment directory called build. This is where you will actually build the assignment and then moves to that directory. This is not included in the provided code since we are following industry standard practices and you would normally exclude the build directory from any source control system.

Now you need to actually run CMake as follows.

cmake ..

This runs CMake to initialize the current directory which is the build directory you just made as the location to build the assignment. The one argument to CMake here is .. which referes to the parent of the current directory which in this case is top of the assignment. This directory has the files CMake needs to setup your assignment to be build.

At this point you can in the build directory run make as described to build the various programs for the MP.

The code for this activity resides in the lab_dict/ directory.

Memoization

Memoization is an aspect of Dynamic Programming, which you’ll learn much more about when you take CS 374. However, it’s a fairly simple idea, and a common implementation uses dictionaries at its core.

Factorial

Our first introduction to memoization will be through the use of factorials.

The factorial is a deterministic function: for a given input \(n\), the value (\(n!\)) is always the same. The formal (recursive) definition is:

\begin{align} 0! &= 1 \\
n! &= n \times (n-1)! \end{align}

For example, say we already know the value of \(5!\) and we want to compute \(6!\). The naïve way computing of \(6!\) would be to calculate \(6! = 6\times(5\times(4\times(3\times(2\times(1)))))\). Since we already know the value of \(5!\), however, we can save quite a bit of computation (and time) by directly computing \(6! = 6\times5!\), using the known value of \(5!\).

This idea is essentially what memoization is. According to Wikipedia:

[M]emoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again

Now let’s look at how we do this in code by looking at the memoized_fac function in fac.cpp. On line 41, we find:

static map<unsigned long, unsigned long> memo = { {0, 1} };

The left hand side is creating a variable named memo which is a std::map that associates keys of type unsigned long to values of type unsigned long. In our case, we’re going to map a key \(n\) to the value \(n!\).

The static keyword is a neat little thing that basically makes this local variable remain initialized for the lifetime of the program so that we can reuse it across multiple calls to the function. It’s like a global variable, but it’s only accessible by that function.

The right hand side initializes the map to contain a key-value pair where the key is 0 and the value is 1. Notice that this key-value pair corresponds to the base case of our mathematical definition of \(n!\) given above.

Alternatively, we could have used

static map<unsigned long, unsigned long> memo;
memo[0] = 1;

The disadvantage of this method is that now the key 0’s value will be changed to the value 1 during every function call, as opposed to just at initialization as in the first example.

Line 45 is

map<unsigned long, unsigned long>::iterator lookup = memo.find(n);

memo.find(n) uses the std::map::find function. As mentioned above, this method returns an iterator; the type of lookup is std::map<unsigned long, unsigned long>::iterator.

Lines 46–52 are:

if (lookup != memo.end()) {
    return lookup->second;
} else {
    unsigned long result = n * memoized_fac(n - 1);
    memo[n] = result;
    return result;
}

Going back to what we already know about iterators and std::map::find, a return value of memo.end() basically implies that the key n does not exist in the map. In this case, we need to compute n! and update our map to store that value, which is exactly what we do in the else branch.
A std::map::iterator is basically a pointer to a std::pair<const key_type, mapped_type> . A std::pair stores two objects (key and its associating data in the case of map) in a pair. To access the first object in a std::pair, we use the member name first; to access the second object, we use the member name second. For example, in the if branch above, lookup->second returns the value stored in the map for key lookup->first, which is going to be n in our case. If we wanted to be clever, we could rewrite this code as:

if (lookup == memo.end()) {
    memo[n] = n * memoized_fac(n - 1);
}
return memo[n];

Memoizing the factorial function doesn’t actually do much for us in terms of computation time. To see this for yourself you can run:

make factorial
time ./factorial 20

and compare it to the time it takes for the memoized version by passing the -m flag:

time ./factorial 20 -m

Fibonacci

Now that you’re familiar with how to memoize the factorial function, you’ll apply memoization to the Fibbonacci function. As we’ll soon find out, memoizing this function has real speed benefits.

The Fibonacci numbers are recursively defined as:

\begin{align} F(0) &= 0 \\
F(1) &= 1 \\
F(n) &= F(n-1) + F(n-2) \end{align}

If you draw out some trees for calculating these numbers (which I highly recommend) you’ll notice something interesting. For example here is such a tree for \(F(6)\); each node represents a function call:

There’s quite a bit of repetition in this tree. Specifically the entire \(F(3)\) subtree appears three times in the tree and \(F(4)\) twice. The values in those trees get calculated twice. For larger values of \(n\), the tree naturally gets larger and so too do the repeated subtrees. Remember that the nodes in this tree represent function calls, so each one takes some time. Suppose the \(F(4)\) call on the right happens before the \(F(4)\) call on the left. By the time we get to the \(F(4)\) call on the left, we already knew the value of \(F(4)\). What if we could store the result of \(F(4)\)? If we stored it, then when we call \(F(4)\) on the left we don’t need to bother with going through that entire subtree.

As we found out with factorial, we can use a dictionary for exactly this purpose. Imagine that inside our \(F\) function that we associate an input number \(n\) with a Fibonacci number. Every time we successfully calculate a value \(F(n)\) we store the pair \((n, F(n))\) in our dictionary. So now in our function we first check to see if \(n\) already exists in the dictionary (i.e. we’ve already computed it). If it is, we return that immediately. Otherwise, we make the normal recursive call, store the result in the dictionary, and then return the result. Using memoization the tree becomes:

The dashed nodes are ones which didn’t need to be calculated; one of their parent’s values was taken from the dictionary rather than being recalculated from scratch. The red nodes are the only calls where it’s calculated. As you’ll see in a while, the difference memoization makes is rather dramatic.

Your task is to implement both the normal and memoized version of the fib function in fib.cpp. After you do this you can race them with the fib_generator executable:

make fib_generator
time ./fib_generator 45

To use the memoized version, pass the -m flag:

time ./fib_generator 45 -m

Common Words

The next part of this assignment involves making an object that can find common words in different files. The end goal is: given a list of files and a number \(n\) find all words that appear in every file at least \(n\) times. This is a sort of conditional intersection of the words in the files. As you might expect, you’ll be using a dictionary to solve this problem, with the words as the keys, and their frequencies as the values. (This is a common application of associative structures). You will accomplish this by finishing the CommonWords class.

You can find the Doxygen for the CommonWords class here. You are responsible for writing the init_file_word_maps, init_common, and get_common_words functions. The former two will help you write the latter function. The header file (common_words.h) has descriptions of the variables you are initializing.

Once you’ve written your class, you can compile it with make find_common_words. You can then test it on some small text files or novels:

./find_common_words ../data/small1.txt ../data/small2.txt -n 3
dog
pig
./find_common_words ../data/PrideAndPrejudice.txt ../data/Beowulf.txt -n 500
and
in
of
the
to

The executable command below finds all the words that appear greater than or equal to \(n\) times in ALL the parameter text files. Here is the usage:

./find_common_words [TEXT FILES] -n [NUM] [-o FILE]

Pronunciation Puzzler

For this part of the lab you’ll complete an object which will help solve the following puzzle (taken from CarTalk):

This was sent in by a fellow named Dan O’Leary. He came upon a common one-syllable, five-letter word recently that has the following unique property. When you remove the first letter, the remaining letters form a homophone of the original word, that is a word that sounds exactly the same. Replace the first letter, that is, put it back and remove the second letter and the result is yet another homophone of the original word. And the question is, what’s the word?

Now I’m going to give you an example that doesn’t work. Let’s look at the five-letter word, ‘wrack.’ W-R-A-C-K, you know like to ‘wrack with pain.’ If I remove the first letter, I am left with a four-letter word, ‘R-A-C-K.’ As in, ‘Holy cow, did you see the rack on that buck! It must have been a nine-pointer!’ It’s a perfect homophone. If you put the ‘w’ back, and remove the ‘r,’ instead, you’re left with the word, ‘wack,’ which is a real word, it’s just not a homophone of the other two words.

But there is, however, at least one word that Dan and we know of, which will yield two homophones if you remove either of the first two letters to make two, new four-letter words. The question is, what’s the word?

Basically, we’re looking to find words such that

  • the word itself,
  • the word with its first character removed, and
  • the word with its second character removed

are all homophones (i.e., have the same pronunciation) of each other.

To accomplish this we are going to employ the help of CMU’s Pronouncing Dictionary. The data associated with that dictionary is stored in a file (data/cmudict-0.7b) and read on construction of a PronounceDict object (see pronounce_dict.h). You don’t have to worry about the constructors this time since they’re not very interesting. Instead, you’re responsible for writing the homophones function which determines whether two words are homophones are not. Check out pronounce_dict.h to see what kind of dictionary structure you’re working with, and remember that it is mapping a word to its pronunciation.

Here is the Doxygen for the PronounceDict class.

PronounceDict expects the words to be uppercase. Here’s how you can transform a std::string str to uppercase in C++:

std::transform(str.begin(), str.end(), str.begin(), ::toupper);

It uses std::transform (you know how to look this function up, right?) We’ll leave it up to you to try and figure out exactly how it works. If you need help, post on Discord (or Piazza)!

Next you will have to write the code which actually solves the puzzle. This function resides in cartalk_puzzle.cpp. It takes a word list file name (similar to before) and a PronounceDict.

You need to be able to read in a list of words from the word list. To do this, you’ll need to know some basic C++ file I/O. This following snippet will print out every word in a word list file:

ifstream wordsFile(filename);
string word;
if (wordsFile.is_open()) {
    /* Reads a line from `wordsFile` into `word` until the file ends. */
    while (getline(wordsFile, word)) {
        cout << word << endl;
    }
}

To solve the actual puzzle, methods like the std::string’s substr function will probably prove useful. If all goes well you should be able to make it and run the executable with:

make homophone_puzzle
./homophone_puzzle

If you want to debug with cout, I would recommend typing using std::cout; and using std::endl; instead of a blanket using namespace std; or else you may get some nasty errors.

Anagrams

The final part of this assignment involves making a dictionary for looking up anagrams of a given word. Two words are anagrams of one another if we can rearrange the letters of one to form the other. For example, the letters in the word “dog” can be rearranged to form the word “god”. We want an object (built from a word list) whom I can feed the string "dog" and it will spit out both "dog" and "god".

You’ll find the definition of the AnagramDict class in anagram_dict.h. You are responsible for implementing both constructors, as well as the functions get_anagrams and get_all_anagrams. AnagramDict has an instance of map named dict which maps a string to a vector of strings.

Remember that when using dictionaries we need to figure out a way to uniquely identify each of our values. In this case we want to uniquely identify a group of words which are anagrams of one another: we need some sort of function f(key) that we can apply to a key such that f("dog") == f("god") and f("silent") == f("listen") == f("tensil"), but f("dog") != f("silent") Phrased differently, you’ll need to come up with some sort of a function, such that it always returns the same value for all words that are anagrams of one another.

Once you figure out what a function f must do to satisfy that constraint (and there are many possible answers), you’ll find it useful to Google how to actually implement that function if it seems non-trivial to you. (Actually, you’ll find it useful to Google for pretty much anything, including what should f do.)

Here is the Doxygen for the AnagramDict class.

One of the AnagramDict constructors takes a word list file name, which is expected to be a newline-separated list of words (see words.txt). Read this file the in same way as you did for the pronunciation puzzle.

To test your dictionary you can use the anagram_finder executable, which you can compile by running make anagram_finder. Example usage and output:

./anagram_finder dog
Anagrams for dog:
dog
god

You can run it with the flag -a to print out a list of all the known anagrams for this word list. -o will redirect the output to a file, e.g.

./anagram_finder -a -o anagrams.txt

You can diff your output of the above command with ../data/anagrams-expected.txt. You may also specify your own word list file with -w if you’re feeling adventurous (it defaults to the words.txt in the data directory).

Testing Your Program

As usual, there are Catch tests for this lab, which can be run with:

make test && ./test

Submitting Your Work:

The following files are used in grading:

  • anagram_dict.cpp
  • anagram_dict.h
  • fib.cpp
  • pronounce_dict.cpp
  • pronounce_dict.h
  • cartalk_puzzle.cpp
  • common_words.cpp
  • common_words.h

All other files including any testing files you have added will not be used for grading.

Good Luck!