lab_bst

Brilliant Binary Search Trees

Due: Oct 28, 23:59 PM

Learning Objectives

  • Practice storing information in a binary search tree
  • Use a data structure (BST) to implement another data structure (dictionary)
  • Practice handling (key, value) data pairs

Reminder: You can fill out an optional survey on Moodle after each lab!

Assignment Description

One method of analyzing human genomes is to quantify the relative expression levels of all known genes. However, not every person expresses every human gene and many human genes are not expressed under most conditions. In this lab, we will take advantage of this style of analysis to build binary search trees on different ‘individuals’ that we can compare directly.

In CS terms, we will be parsing an input file that contains a large list of key-value pairs, where our keys are the IDs of a gene, and the values are it’s relative expression in the genomic sample. For each individual (each file) we will then build a binary search tree over the gene names, use a traversal to identify all genes above a expression threshold, and compare the shared expression between individuals as a similarity metric.

NOTE: There are over 20,000 known human genes and – including variations- – a curated genome dataset can contain close to 50,000 different gene names. The datasets used here (and even the data format) are synthetic datasets generated randomly but based on real gene labels HGNC, which can be represented by a single number. Likewise the output format is a simplification of a real gene quantification tool Salmon.

Beware! Unlike most labs, it is (practically) mandatory to complete part 1 to complete part 2 and the tests for part 3 will call buildTree. Start early so you can get help if you need it!

Part 1: Binary Search Tree Insert

# INPUT:
# A int 'key' containing the numeric ID of the gene being inserted
# A float 'val' containing the expression of the gene
# OUTPUT:
# Nothing
None bst.insert(str key, float val):

Given the bst implementation discussed in class, your first task is to implement the insert function. If you get stuck, take a look at the provided code for both find and remove and try to work out how the code for insert can be coded.

Part 2: Building a bst from a tab separated file

# INPUT:
# A string 'inFile' containing the name of the input gene file
# A float 'cutoff' containing the minimum cutoff for the gene to be inserted
# OUTPUT:
# A bst object whose root is the root node of a tree containing all expressed gene key-value pairs
bst buildTree(str inFile, float cutoff):

Implement a function which, given a tab separated file of gene expression, will return a bst object that contains all genes that have an expression estimate greater than or equal to cutoff. The order that the values are inserted should be the order that they are listed in the file. The file format will always have the first row of the file as a label of each column (these columns are [in order]: the gene ID, the length of the gene, and it’s estimated expression level). The ‘key-value’ pair of interest is the gene ID as the key and the expression level as the value (the length is not used). The variable types for the input file are as follows:

  • geneID : An integer
  • length : An integer
  • Expression : A float

HINT: "\t" is the text symbol for a tab character.

NOTE: You my assume the input is always ‘well-formed’ (there are no missing values).

NOTE: If the height of the tree gets too large, the print function will not render properly on a terminal. If this situation occurs and you want to still visualize the tree, you can ‘write output to file’ by using the ‘>’ symbol. Ex: python3 bst.py > output.txt will store the output in output.txt so you can scroll through the tree.

Part 3: Use your dictionary to identify people at risk of genetic diseases.

# INPUT:
# A binary tree 'person' (generated from buildTree)
# A float 'cutoff' containing the minimum cutoff for the gene to be considered 'expressed'
# A list of integers, 'expressed' containing keys that must be expressed for the check to be True
# A list of integers 'notExpressed' containing keys that must not be expressed for the check to be True
# OUTPUT:
# A bool indicating whether or not the person matches the two-part query
# NOTE: This 'cutoff' does not have to be the same cutoff as the one used to build the tree.
bool checkGenome(bst person, float `cutoff`, list[str] expressed, list[str] notExpressed):

The susceptibility of an individual to many diseases (and the likelihood of many cancers) can be predicted based on the expression level (or lack of expression) of multiple genes simultaneously. In this function, you are tasked with using a bst you’ve created to check for the expression of a list of genes and the lack of expression of other genes. To simplify the problem, treat the cutoff as a universal value – any gene expression which is greater than or equal to cutoff is expressed, any gene which is less than that is not expressed. The function should return True only if all expressed genes are expressed and all non-expressed genes are not. Your function should still function if one of the two input lists is empty (ex: it should work for only expressed or only non-expressed checks)

NOTE: If you are interested in underlying biology here, the tested examples will once again be a toy system based on real world problems. In this case, the basis here is a systems-level understanding of biology with the general idea that most cancers don’t have a single cause but rather a series of small failures that don’t have obvious effects alone but when they occur together can cause catastrophic problems by destabilizing a key system in a cell.