lab_btree

Belligerent BTrees

Assignment Description

In this lab you’ll learn about BTrees and how they can be used to implement the dictionary ADT. Specifically you’ll learn about the algorithms involved in finding and inserting into a BTree. In addition to the algorithms, you’ll see why BTrees are a useful structure, and a potentially good alternative to self-balancing binary search trees.

Lab Insight

B-Trees are a powerful data structure for distributed data storage. One can optimize access to this data structure by potentially splitting each node across a distributed system or perhaps even using a data structure like it to build databases (applications specifically designed to store and lookup massive amounts of data rapidly). To learn more about how B-Trees are used in databases, CS 411 is a database course that offers information on this topic.

Checking Out the Code

All assignments will be distributed via our release repo on github this semester. You will need to have set up your git directory to have our release as a remote repo as described in our git set up.

You can merge the assignments as they are released into your personal repo with

git pull --no-edit --no-rebase release main --allow-unrelated-histories
git push

The first git command will fetch and merge changes from the main branch on your remote repository named release into your personal. The --no-edit flag automatically generates a commit message for you, and the--no-rebase flag will merge the upstream branch into the current branch. Generally, these two flags shouldn’t be used, but are included for ease of merging assignments into your repo.

The second command will push to origin (your personal), which will allow it to track the new changes from release.

You will need to run these commands for every assignment that is released.

All the files for this lab are in the lab_btree directory.

Preparing Your Code

This semester for MPs we are using CMake rather than just make. This allows for us to use libraries such as Catch2 that can be installed in your system rather than providing them with each assignment. This change does mean that for each assignment you need to use CMake to build your own custom makefiles. To do this you need to run the following in the base directory of the assignment. Which in this assignment is the lab_btree directory.

mkdir build
cd build

This first makes a new directory in your assignment directory called build this is where you will actually build the assignment and then moves to that directory. This is not included in the provided code since we are following industry standard practices and you would normally exclude the build directory from any source control system.

Now you need to actually run CMake as follows.

cmake ..

This runs CMake to initialize the current directory which is the the build directory you just made as the location to build the assignment. The one argument to CMake here is .. which referes to the parent of the current directory which in this case is top of the assignment. This directory has the files CMake needs to setup your assignment to be build.

At this point you can in the build directory run make as described to build the various programs for the MP.

The BTree Class and Friends

You’ll find the definition of the BTree class in btree.h. Just like AVLTree, BTree is generic in that it can have arbitrary types for its keys (assuming they can be compared) and values. There are two subclasses in BTree: DataPair and BTreeNode. A DataPair is essentially just a key, value pair which is useful since our BTree implements dictionary functionality. The BTreeNode is a node in the BTree. It contains two vectors: elements and children. elements is the data in the node, children is pointers to the node’s child nodes. Note that DataPairs can be compared with <, >, and == based entirely on their keys (the functions are defined in the DataPair class).

Since BTreeNode uses vector pretty heavily, it would probably be a good idea to keep a reference open while working with them (in particular, vector’s insert and assign functions will probably prove useful).

The Doxygen for the BTree class is here.

The insertion_idx Function

Doxygen for this function is here.

The elements of a BTreeNode are meant to be kept in sorted order so that they can be searched with a binary search to quickly find the key in the node (if it exists) or the proper child to explore. There’s a utility function in btree.h called insertion_idx which serves this purpose. It is not specific to BTree, rather it takes an arbitrary vector and arbitrary value and tries to find the index in the vector the value can be inserted at such that the vector will remain in sorted order. You should be able to safely use operator<, operator>, and operator== inside this function when comparing the parameter value to the elements in the vector. Thus, I should be able to do something like this to insert 5 into the position that keeps sorted_vec sorted:

/* sorted_vec is a sorted vector of ints */
size_t insert_idx = insertion_idx(sorted_vec, 5);
sorted_vec.insert(sorted_vec.begin() + insert_idx, 5);

You can test your function with provided test cases, or you can write your own testing functions for it somewhere. Note that because our entire tree is stored in memory, using an online algorithm such as a linear search is unnecessary and the more efficient binary search can be used.

The find Function

The Doxygen for this function is here.

Searching a BTree is very much like searching a BST. The main difference is that instead of only having two possible children to explore, we have order-many potential children. As with a BST, it is natural to write find recursively. For a given level of the recursion, the steps are as follows. First you need to do a search on the elements using insertion_idx (Linear search is acceptable for this assignment but binary search is used in the professional implementation. If you choose binary search, remember elements should always be stored in sorted order) . If the key at insertion_idx is the one we are looking for, we are done (since keys in the tree are unique) and return, otherwise we didn’t find the key in the current node. If the current node is a leaf node, then we’ve run out of nodes to explore and we are also done. Otherwise, we need to call find recursively on the proper child of the current node. This can be done using the return value of insertion_idx.

Below is a few examples of find on a small, order 3 BTree.

The initial tree:

We search for 39. We start at the root, call insertion_idx, which should return 2, since that’s the index in the root node that 39 could be inserted at to maintain sorted order. Since this isn’t a valid index in the current node, we know we need to explore a child. We also know that 2 is conveniently the index of the child pointer (the bar) we want to explore.

We call insertion_idx on this node and find the key we are looking for.

Now let’s try finding something that isn’t in the tree, -2. We use insertion_idx to find the place we would insert -2, which is index 0. This is also the index of the child we want to explore.

When call insertion_idx on this node and find that the key is not in this node. Since this node is also a leaf node, the key isn’t in the tree.

The split_child Function

The Doxygen for this function is here.

split_child is used as a part of the insertion process in a BTree to “fix” the tree. When a node becomes “too full”, i.e. when its number of elements becomes the tree’s order, we have to split the node into two nodes, and throw out the median element into the parent node. The process should become more clear with the below examples (combined with the insert examples). When actually implementing this function you will probably want to use vector’s assign function.

The insert Function

The Doxygen for this function is here.

You have to write the recursive helper function insert (the public one is already written for you). Insertion into a BTree always happens at the leaf nodes; inserting involves finding the proper leaf node (recursively) that the key belongs in, and then “fixing” the tree on the way back up the recursive calls.

As with find, first you should do a search on the current node being explored. If the key we are trying to insert exists in the current node, then there’s no more work to be done (don’t replace the value associated with the already existing key).

Below is an example that illustrate the insertion/splitting process. Let’s start with 13.

We find the child we want to insert into in the same way as find. We have to verify that the key isn’t in the current node and then recursively call insert until we hit a leaf node to insert into.

We use insertion_idx at the leaf to find which spot the key belongs in.

Now let’s do an insert the requires a split; let’s insert 25.

Same process as with finding 25, we use insertion_idx to find the child to recursively explore.

We insert 25 to the place it belongs in the leaf, and we are done at this level of recursion. After the return (i.e. when we are back at the node containing 8, 20) we check the size of the node we just inserted into. We see that its size is equal to order, so we have to split it (using split_child). First we find the median of the node we are splitting (element 25) put it at the proper place in the 8, 20 node. We have to create a new node containing everything to the right of 25 (including the child pointers to the right of 25). Similarly, the left node contains everything to left of 25. We have to make sure to add the new right child pointer to the children of the 8, 20 node.

At this point we notice that the root has to be split.

The median of the current root (20) gets put into a new root node, and its two children are the 8 node and the 25 node.

Take note of the children indices that ended up in the two children nodes. 20 had index 1 in the original node. The child pointers that ended up in the left node had indices 0 and 1, the child pointers that ended up in the left node had indices 2 and 3. Note that for trees of odd order this ends up working out very nicely, since there’s only one choice of median. You have to be very careful though since this is not the case for even orders.

Testing Your Code

As always, there are provided tests for the lab. Run them like you normally would:

make test
./test

There is also an executable called test_btree, made by running:

make test_btree

If you run it with no parameters it will default to some basic tests with sequential and random integer data. It also can take parameters:

USAGE: test_btree ORDER N
Tests N inserts and N finds on a BTree<int, int> of order ORDER.

Additionally, note that BTreeNodes can be printed (see here). E.g. if I’m in find I can do something like:

cout << *subroot << endl;

Grading Information

The following files are used in grading:

  • btree.h
  • btree.hpp

All other files, including any testing files you have added will not be used for grading. Remember that submissions must be done through Prairielearn!

Race Against std::map

So you made this BTree, now how good is it really? Well, you can find out by using the dict_racer executable. The usage is as follows:

USAGE: dict_racer ORDER N STEP RANDOM INSERTS FINDS
Runs a race between a BTree<int, int> of order ORDER against an
std::map<int, int> for N inserts / finds. Outputs CSVs into "results".
ORDER specifies the order of the BTree
N specifies the max number of insert / finds to do
STEP specifies the intervals to split N into. E.g. N = 10, STEP = 2 will make
points for 2 operations, 4 operations ... &c.
RANDOM specifies whether the data should be random or sequential.
INSERT specifies whether to benchmark the inserts.
FINDS specifies whether to benchmark the finds.

Results can be plotted with the simple python script generate_plot.py, e.g.

python3 ../scripts/generate_plot.py results/*.csv

For example, the following image was produced with the following commands on my laptop:

make dict_racer
mkdir results
./dict_racer 64 200000 1000 true true true
python3 ../scripts/generate_plot.py results/*.csv

std::map in this case is implemented with a form of self-balancing binary search tree called a red-black tree. As you can see, the difference in performance widens as we increase \(n\). I would guess that this is mostly because the BTree keeps more of the data its processing in cache most of the time, which is faster than loading each key, value from memory during a traversal (which is what a binary search tree needs to do).

It’s worth experimenting with different orders, values for \(n\), and step values. I wouldn’t recommend testing on EWS remotely as they’re shared machines and performance numbers are mostly meaningless. Also note that the Python script is very simple, so only keep CSVs around that have different values for order and step size, otherwise the plot will look pretty weird. Alternatively you can just remove the results directory each time.