Learning Objectives
- Practice data parsing for authentic genomic data types
- Explore more data science applications for list indexing and 2D matrices
- Implement multiple forms of common data analyses using dictionaries and sets
- Build your own data processing pipeline using a collection of functions
- Visualize different labeling schema for cancer vs normal genomic datasets
Getting set up
Using the Prairielearn workspace, test and save your solution to the following exercises. (You are welcome to download and work on the files locally but they must be re-uploaded or copied over to the workspace for submission). If you choose to work locally, you are strongly encouraged to keep an active Git repository of your work.
If you want to create a very large animation, you are encouraged to run your animation code locally.
In Part 1, you will be implementing a FASTA-file reader and cleaner which will be relevant to Part 2.
The Part 1 functions are:
reads_from_fasta()
kmer_from_string()
remove_bad_data()
kmers_from_file()
For Part 2, you will implement data analysis functions for sets of kmers (the output of Part 1) as well as a ground truth ‘true label’ taken from metadata.
The Part 2 functions are:
count_matches()
jaccard_sim()
build_similarity_matrix()
label_by_threshold()
get_threshold_labels()
get_true_labels()
For part 3, you will use the functions in Part 1 and Part 2 to perform a full data analysis / labeling of a randomly assigned dataset.
Assignment Description
Creating a data processing pipeline for a large data collection is a very common task in both research and industry. Here you are tasked with creating a computational genomics pipeline which will take as input two authentic (but simplified) genomic data formats and ultimately determine which of the provided samples are ‘cancer’ and which are ‘normal’ (or healthy).
Part 1: Data Parsing and Data Cleaning (FASTA Files)
In part one of this mini-project, you are tasked with creating several functions designed to process and clean two types of fundamental genomics datasets. These functions will build off each other and you are strongly encouraged to use earlier functions to help complete later ones. You may find several of these functions useful in part 2!
Although you do not need to know any biology to complete this mini-project, the input sequences you will be dealing with are nucleotides – a four letter alphabet consisting of the characters ‘T’, ‘C’, ‘G’, and ‘A’. Unfortunately as is often the case in the real world, your input data is not perfect and may contain the character ‘N’. This common convention indicates that the input sequencer could not precisely determine the actual character at the given position.
To assist you in debugging, there are several examples of each format in the directory data/
. As this data is synthetic (but based on real-world data formats), when debugging you can use the provided file generator to recreate the specific file being randomly generated examples in the autograder.
Fasta-File Format
The FASTA file format is among the most common genomics data formats and stores either nucleotide or protein sequences. FASTA files stores a collection of sequences with a single ‘header’ line for each sequence. The header line starts with a greater-than symbol (>) and contains information like the sequence name, species, or any other relevant metadata. All other lines (until the next ‘>’) are a part of the same sequence and need to be stitched back together into one large string!
There are two main classes of FASTA files – those that encode genes and those that encode sequencing studies.
-
Gene files store a single large sequence string which is guaranteed to be correct / accurate [for the purposes of this mini-project]. The majority of the gene files provided are using real gene sequences that have been truncated to drastically reduce the size! The actual gene sequences are 30-60x larger than the ones we are using.
-
Study files store large collections of ‘short sequencing reads’. There is biological / engineering reasons why data is collected in this way but it is sufficient to know that unlike gene files, these contain many headers and can contain sequencing errors! The sequencing studies presented here are entirely synthetic and contain only a mixture of the provided gene files. An actual sequencing study would contain tens to hundreds of millions of sequences, with each sequence being larger than the ones provided!
The following functions should work on both kinds of FASTA files (as the only difference is the number of header lines / size of each sequence):
-
‘data/genes/’ contains real human reference genes downloaded from https://www.ncbi.nlm.nih.gov/! Each gene is stored with a single header line and 70 nucleotides (TCGA) per line. These files will test your ability to stitch together one sequence from multiple lines
-
‘data/studies/’ contains fake sequencing studies which are built to mimic either a cancer or normal dataset in a format matching what you would see from https://www.ncbi.nlm.nih.gov/sra. These files will test your ability to process multiple sequences
NOTE: The autograder will primarily run on a dataset ‘tinydata’ which has a mirrored structure to data. This is because even the significantly truncated data set can time out the autograder if you are even moderately inefficient! You are encouraged to test on ‘tinydata’ and then ‘data’ in preparation for Part 3!
reads_from_fasta()
# INPUT:
# inFile, a string storing the filename of interest
# OUTPUT:
# A list of strings containing all sequences in the fasta (but none of the metadata)
# There should be one sequence per header line in the file.
# Remember each sequence may be stored over multiple lines! Your code should be able
# to 'stitch' together multiple lines of text into a single sequence and repeat this
# process for every header in the file.
def reads_from_fasta(inFile):
Given a FASTA-formatted file, produce a list of every sequence in the file. To do this, consider carefully how you will handle the following:
-
The information in the header files is not relevant but you must be able to detect header lines.
-
The actual sequences can be split across multiple rows of the input text file. Your solution must combine an unknown number of rows of text into a sequence and repeat this process for every sequence (one per header line).
kmer_from_string()
# INPUT:
# inString, a string storing the sequence being broken into kmers
# k, an integer storing the length of each substring being parsed
# OUTPUT:
# A set of strings containing all kmers in the input string
# Each kmer is a substring with a fixed size k and there should be
# one kmer for every starting position in inString which can generate
# a full size kmer.
def kmer_from_string(inString, k):
Given a string storing a sequence of interest and an integer k, return a Python set containing all kmers
in the sequence. A kmer
is a k-length subsequence formed by shifting the start position across the full sequence until all k-length subsequences have been processed.
As an example, for the input sequence TCGAGCAAGC
and k=3
, the kmers would be the strings {'TCG', 'CGA', 'GAG', 'AGC', 'GCA', 'CAA, AAG}
. Visually it looks like:
TCGAGCAAGC
TCG GCA
CGA CAA
GAG AAG
AGC AGC
Note: ‘AGC’ is not stored twice despite being present twice – because our output is a Python set!
Note: ‘GC’ and ‘C’ (the last two start positions) are not stored because they are not k-length! Only k-length sequences are stored!
remove_bad_data()
# INPUT:
# stringList, a list of strings containing gene sequences
# OUTPUT:
# A list of strings containing only gene sequences with no 'N' characters
# This function should NOT modify the original list
def remove_bad_data(stringList):
You cannot always assume that the input data is 100% correct – especially when running a bulk analysis over many files! Here you are tasked with implementing the simplest form of data cleaning by identifying bad sequences (which contain an invalid character ‘N’) and removing each bad string from the input dataset!
More simply, given an input list of strings return a new list of strings which contain only gene sequences with no ‘N’ characters. Be careful NOT to edit the input list but instead make a new list containing only good characters.
kmers_from_file()
# INPUT:
# inFile, a string storing the filename of interest
# k, an integer storing the length of each substring being parsed
# OUTPUT:
# The set of unique kmers parsed from inFile after appropriate data cleaning
# steps are performed (removing all sequences with 'N' in them)
def kmers_from_file(inFile, k):
Using the above functions, write a combined data parsing pipeline that will take as input a FASTA-format file of interest and an input k-length and return the combined kmers found in this file. This is commonly referred to as treating an input file as a ‘bag of words’ (or the ‘bag of words model’) because we have thrown out all information about individual reads as well as any perceived order of our strings in favor of a simplified set representation.
To do this correctly, be sure to remove bad sequences BEFORE creating a set of kmers! We don’t want to ‘corrupt’ our dataset with any potentially bad sequences!
Part 2: Data Analysis
Given a collection of FASTA formatted files, your next task is to implemented a number of different analysis methods for identifying the contents of each file, comparing different files to each other, and labeling sequencing studies as either ‘cancer’ (1) or ‘normal’ (0) using multiple approaches.
Bag of Words Analyses
These functions extend directly from Part 1 and treats each file as a ‘bag of words’ Python set of kmers. These functions will need to be completed before completing the final labeling functions as well as the graph outputs needed in Part 3.
count_matches()
# INPUT:
# dataSet, an input set of values storing the data of interest (the sequencing study)
# querySet, an input set of values storing the queries of interest
# OUTPUT:
# An integer counting the number of items in the querySet which are in the dataSet
def count_matches(dataSet, querySet):
Arguably the most fundamental data science question in genomics is whether or not a specific gene or expression pattern exists in a biological sample (study). There are many use cases for this including diagnosing diseases based on the presence or absence of key genes to determining gene function based on its relative expression levels across different cell lines or samples.
Here we will use the ‘bag of words’ model to accurately assess the contents of a sequencing study using the raw count of items in our dataset which match the query. More specifically, given an input set of kmers storing a sequencing study of interest and an input set of kmers storing our queries of interest, return the raw count of the querySet kmers which are found in the sequencing study.
jaccard_sim()
# INPUT:
# s1, an input set of values
# s2, an input set of values
# OUTPUT:
# A float storing the Jaccard Coefficient between two sets
def jaccard_sim(s1, s2):
Given two input sets, return the Jaccard Similarity Coefficient between the sets. The Jaccard Coefficient is a float between 0 and 1 which represents the overall similarity between two sets. It is defined as the number of items in the intersection divided by the number of items in the union.
build_similarity_matrix()
# INPUT:
# fileList, an input list of sequencing studies
# k, the kmer size each study should be processed with
# OUTPUT:
# A 2D Python matrix containing the full pairwise similarity matrix
# using the Jaccard Coefficient over kmer sets.
def build_similarity_matrix(fileList, k):
Given a list of files, each of which is a unique sequencing study, build a 2D Python matrix containing the Jaccard Coefficient for every possible pairwise comparison (including the self comparison). Both the row and column size should be equal to the number of files (a square matrix) and the order should match that of the input list.
Hint: If implemented correctly, the diagonal should always have a Jaccard similarity of 1.0 and the ‘upper diagonal’ (items in the triangle above and to the right of the diagonal) should be a mirror of the ‘lower diagonal’ (items in the triangle below and to the left of the diagonal).
Classifier Analysis
Although Machine Learning is not a part of the curriculum for CS 277, it is a natural extension to many data analysis problems. Here we will explore a simple classification strategy and compare its accuracy to that of the actual ground truth!
label_by_threshold
# INPUT:
# dataSet, a set of kmers present in an input file
# querySet, a set of kmers present in a query sequence
# threshold, a float with a value between 0 and 1
# OUTPUT:
# An integer denoting the label of the dataSet as either cancer (1) or normal (0)
# The input dataSet should be labeled as cancer based on the amount of query kmers
# which can be found in the dataset. If the fraction of query kmers present over
# total query kmers equals or exceeds the threshold, it is cancer!
def label_by_threshold(dataSet, querySet, threshold):
Logically if a medical biopsy detects the presence of known cancer biomarkers (genes), it is a good indication that the patient in question has cancer. Based on this, one logical method for labeling a cancer file is to determine if there at least X% of known cancer sequences in a sample. If yes, well we are very likely to be cancerous and should label it as such.
For those interested, this labeling strategy is the core concept behind a Sequence Bloom Tree!
Accordingly given a dataSet (the kmers present in a sequencing study), a querySet (the kmers we are using as a proxy for ‘cancer’), and a threshold value return an integer value which is ‘1’ for cancer or ‘0’ for healthy / normal.
get_threshold_labels()
# INPUT:
# fileList, a list of strings storing the filenames / filepaths to sequencing studies being labeled
# querySet, a set of kmers present in a query sequence
# threshold, a float with a value between 0 and 1
# k, an integer storing the size of kmers to be processed for each study
# OUTPUT:
# A dictionary where keys are the study ID 'S{}{}{}' and the label is the value
# As a reminder for the keys:
# The input will likely contain both a folder path (e.g. 'data/studies/') as well as
# a file format extension (e.g. 'S{}{}{}.fasta') you will need to parse the string down
# As a reminder for the values:
# Each label is an integer either cancer (1) or normal (0) determined by a labeling scheme
# Each file should be labeled as cancer using label_by_threshold logic.
# In other words, if a sufficient amount of the query set is present in the file, it is cancer
def get_threshold_labels(fileList, querySet, threshold, k):
Given a collection of files by filename / filepath, a queryset, a threshold, and a kmer length process and label each dataset, returning a dictionary containing the label for each sequencing study.
HINT: This function can be solved by heavily reusing previous functions you have implemented. If you are stuck, try to write out how the data processing pipeline should go.
get_true_labels()
# INPUT:
# inFile, a string storing the full file path / file name for the metadata file
# cancerGenes, a list of strings storing the names of all cancer genes
# normalGenes, a list of strings storing the names of all normal / healthy genes
# OUTPUT:
# A dictionary where keys are the study ID 'S{}{}{}' and the label is the value
# Unlike the study-based labeling scheme, the metadata already has the correct study ID
# Each label is an integer either cancer (1) or normal (0) determined by a labeling scheme
# The ground truth label is the larger sum of read counts between cancerGenes and normalGenes
# On a tie, you should assign the label '1' for cancer.
def get_true_labels(inFile, cancerGenes, normalGenes):
Warning: This function requires you to work from a raw data input and parse and process it in a single function! You are strongly encouraged to break the problem down into steps, similar to how the FASTA format reader was broken down into a series of small steps with each piece tested separately!
As the datasets provided are synthetic – that is to say generated through a secret Python script – there is a known ground truth answer for each sequencing study provided to you! However rather than telling you the answer outright, you will have to parse the ground truth for yourself from the provided metadata file!
The input file given in this function will always have the same format:
-
Each row is a sequencing study, labeled by the study ID in the first column
-
Each column is a specific gene, labeled by the gene name in the first row
-
Each individual [row][col] pair is the exact number of reads which were pulled from the gene and added to the FASTA file defining the study.
How you parse and process this file is entirely up to you – including pulling from previous assignments seen in this class. To compute the ground truth classification, simply add up all the reads belonging to cancerGenes and all the reads belonging to normalGenes and pick whichever total count is larger!
An example metadata file (with two studies and 15 genes):
study_id MTP53 MCDH1 MKRAS MEGFR MBRCA2 MBRCA1 BRCA1 BRCA2 KRAS CDH1 TP53 EGFR NORM1 ANS42 NORM2
S638 343 249 279 479 462 244 42 84 71 73 74 38 184 138 124
S484 0 0 0 0 0 0 268 486 494 442 261 461 150 183 200
NOTE: Not every gene is required to be labeled as cancer vs normal! You should ignore (not count) any gene not provided as input. For example, if the above example file had ‘MTP53’ as cancer and ‘NORM2’ as normal, we would produce the following dictionary:
{'S638': 1, 'S484' : 0}
In detail: S638 has 343 ‘cancer’ reads and 124 ‘normal’ reads (labeled cancer). S484 has 0 ‘cancer’ reads and 200 normal reads (labeled normal).
TRIVIA: For those interested, this metadata file is a simplified version of another common genomics data format which is created using complex algorithms to estimate the expression levels for each gene. I specifically formatted this file to be similar (but drastically smaller) than gene expression files you can find on recount3! This massive data processing pipeline was created specifically to help people who were interested in doing data science analyses on large data collections who lack access to the computational hardware to complete large standardized analyses.
Part 3: Graph-Based Visualization of Cancer vs Normal
For the final part of the mini-project, use the various functions you have produced as well as the provided graph generation functions to analyze and label a randomly generated dataset, submitting visual representations of the ground truth as well as our threshold-based classifier. Be sure to label each graph with a clear title (an optional argument in the provided graph function)!
You will be randomly assigned a dataset from several distinct sets generated – your creative component is determining what querySet you will be using for the input to the threshold classifier.
In line with this creative component, please include a brief description as to how you determined what queries and what threshold you used and how you decided on both of them. While perfect accuracy is not required, part of your grade in this part of the mini-project is based on the logic behind your approach and “I picked randomly” (or equivalent) will not receive credit. If you wrote a function to systematically explore the parameter space (possible inputs), you should describe the function at a high level.
For the purposes of your analysis, note the following:
-
The input dataset is a collection of fifteen sequencing studies found under ‘data/studies/’.
-
The true label file is under ‘data/metadata/’
-
The sequences of the genes used to generate your studies are stored at ‘data/genes/’
When relevant, use the following parameters:
-
The kmer length should be 20 for all studies
-
‘Cancer’ genes for the purpose of ground-truth metadata analyses are
['MTP53.fasta', 'MCDH1.fasta', 'MKRAS.fasta', 'MEGFR.fasta', 'MBRCA2.fasta', 'MBRCA1.fasta']
-
‘Normal’ genes for the purpose of ground-truth metadata analyses are
['BRCA1.fasta', 'BRCA2.fasta', 'KRAS.fasta', 'CDH1.fasta', 'TP53.fasta', 'EGFR.fasta']
Note: The lists here include the ‘fasta’ extension, while the actual metadata uses gene names.