lab_parsing

Practical Parsing

Due: Sep 04, 23:59 PM

Learning Objectives

  • Review fundamentals of Python I/O
  • Explore common string data formats (.txt and .csv)
  • Identifying and accounting for errors in input datasets

Getting Set Up

Setup your EWS machine…

Everytime you open a new terminal or ssh session to EWS, you will need to run the following:

module load python3/3.8.3

If you want to save some hassles of running the above command frequently, you can write it in your .bashrc file. All commands in the .bashrc file will be automatically loaded after opening a new termial. The command below helps you write the “module load python3/3.8.3” into your .bashrc file in your home directory.

echo 'module load python3/3.8.3' >> ~/.bashrc

… or setup your computer locally

Setup Your Git Repository

From your CS 277 git directory, run the following:

git fetch release
git merge --allow-unrelated-histories release/lab_parsing -m "Merging initial lab_parsing files"

Upon a successful merge, your lab_parsing files are now in your lab_parsing directory.

Assignment Description

Each assignment in this class is built around one or more datasets. In this lab, you will be tasked with reading in input data from arguably the most common data type – the string. As the introductory lab, the focus here will be on the basics of reading text, parsing text into numbers, and storing or processing simple datasets with well-defined formats (but the occassional ‘error’).

Part 1: String Data Basics

The string datatype may seem like the easiest to process but comes with difficult challenges in real world settings. Are you interested in the total document as one large string? Do you need to break the document up into fixed size substrings? Is there a ‘break’ character that separates out words? What is the alphabet being used and how much space is needed to encode a character? How can you tell if the input strings are correct – and if they aren’t how can you correct them?

Many of these questions we will not be able to answer until much later in this course but we can begin by covering the basics – how to read in a file line-by-line or with arbitrary breakpoints and how to correct both when we have some pre-existing knowledge of what should be present in the file or what the format of the file should look like.

List[str] stringParseBreaks(string fileName, string bchar)
# fileName is the relative path of the string file being processed
# bchar is the break character which should be used to split the file into substrings
# The output list should contain the complete collection of substrings formed by "splitting" at the break characters.
# This includes potentially empty strings to the left or right of a break character in the file.
# You may assume the break character is always a single character
# You should NOT strip the whitespace for this function

The function should read the text from start to finish and return the collection of substrings formed by both stopping before a break character and then starting a new substring after the break character. This style of processing is more commonly referred to as splitting the string and is immensely important for parsing compressed or compact datasets, where separating based on a terminal character can be used to find values which are not fixed in size. But be aware that without further processing, you can often have empty strings in the resulting substring set. For example, when run on the provided data file parse1.txt and break character, "$", stringParseBreaks should return the following substrings:

> stringParseBreaks("parse1.txt","$")
["", "ABC", "CDE", " GGG", "AA\n", "1213"]

Note: You are allowed to use Python built-ins to do this assignment. Do not feel like you have to re-invent the wheel.

List[str] stringParseLine(string fileName)
# fileName is the relative path of the file being processed
# The output list should contain each line (even empty lines) in the order they are read (top to bottom)
# To ensure full credit for this function, you should strip whitespace from both sides of each line.

The function should read the text from start to finish and return the collection of substrings formed based on line breaks in the text file. You are encouraged to either use Python built-ins to do this or use stringParseBreaks. However, unlike stringParseBreaks, each substring found should be stripped of whitespace. You can think of parsing by ‘line’ as a more complex version of parsing by break character, as we want just the text which defines a line and not any unneccessary formatting.

Part 2: String Data Error Correction

Many common string datatypes such as written languages or genomic sequences are rife with errors and can cause significant problems if not handled appropriately. To make matters worse, it is often unclear if the data truly contains an error or if there is simply an unexpected or outlier result. Correctly modeling and correcting these sorts of ambiguous situations is one of the hardest problems (and in many instances is simply unsolvable without external data or simplifying assumptions). Here, we will explore a fundamental form of error detection and error correction on strings: a situation where we know (1) all possible input values and (2) the exact format of the input data and can safely remove those which are not in our curated list.

For example, lets say you are teaching a large computer science class and are trying to determine final grades at the end of the semester. To accomplish this, you have a comma-separated values (CSV) file for each student’s grades throughout the course. Files of this type are named gradesX.csv. Unfortunately this file seems to contain far too many students and you realize that the software you used to track grades did not understand students who are auditors or students who have dropped the class. Accordingly, you now have to “clean up” the grade file to get the average grades for the class.

Given the limited time-frame, you decide to try two approaches towards correcting your data. The former is by using your best estimate of the students in the class which you’ve recorded as a line-separated list. Files of this type are named rosterX.txt. The latter is by detecting students who are missing at least three grades. In both cases, your average grade can be calculated by summing all assignments for each student and taking the average total score of the class. To help you parse the data, the first value in the grades CSV will always be the ID of an individual student and you may assume the IDs are unique (no repeats in the file). All other values in the grades file are individual assignments and your solution must work for an arbitrary number of assignments. Note that although the return type is a float, you should truncate all decimal places when returning the final grade. That is to say, if the average grade was an 899.9, the actual final grade should be 899.0.

To provide some clarity on the averaging calculation, the easiest way to compute the average is to (somehow) sum together all grades from all valid students and then divide by the total number of valid students in the class. The number of assigments has no bearing on the final average as it is simply a sum of all scores for the course.

float gradesByRoster(string gradesFile, string rosterFile)
# gradesFile is the relative path of the exam grades file, stored as a CSV
# rosterFile is the relative path of the roster file, which contains a line-separated list of student IDs
# Here valid students are all (and only) students in the rosterFile.

When using the roster as a means of detecting “errors”, you simply need to exclude any students not on the roster from the grading calculations. Also be sure to use the number of registered students rather than the total students when computing your final average.

float gradesByAssignments(string gradesFile)
# rosterFile is the relative path of the roster file, which contains a line-separated list of student IDs
# gradesFile is the relative path of the exam grades file, stored as a CSV
# Here valid students are those who have two or fewer missing grades in the course

When using the number of missing assignments as a means of detecting “errors”, you should look for CSV entries where at least three of the assignment entries are blank. As these students have likely dropped the course, they are likely to be in sequence at the end of the grading run but it is not a guarantee. As before, make sure you keep an accurate tally of the number of valid students rather than averaging the total students.

Hint Keep an eye out for edge cases in data parsing and analysis. The invalid data points are clearly defined here – either three (or more) missing grades or someone who is not on the roster. Be sure your solution can handle all the remaining data points.