lab_parsing

Practical Parsing

Due: Feb 06, 23:59 PM

Learning Objectives

  • Review fundamentals of Python I/O and strings
  • Practice using multi-dimensional lists
  • Explore common string data formats (.txt and .csv)

Submission Instructions

Using the Prairielearn workspace, test and save your solution to the following exercises. (You are welcome to download and work on the files locally but they must be re-uploaded or copied over to the workspace for submission).

Assignment Description

In this lab, you are tasked with reading in input data by filename to produce both one-dimensional or two-dimensional lists. This lab will also require knowledge of how to use Python string methods such as strip() and split() – you are strongly recommended to familiarize yourself with them to make your life easier.

An input text file may seem like easy to process compared to many other data formats but it comes with its own difficulties. Are you interested in the total document as one large string? Do you need to break the document up into fixed size substrings? Is there a ‘break’ character that separates out words?

stringParseLine()

List[string] stringParseLine(string fileName):
# INPUT:
# file is the relative path of the file being processed (string)
# OUTPUT:
# A list containing the complete collection of substrings formed by splitting on line breaks.
# NOTE:
# The output list should contain each line (even empty lines) in the order they are read (top to bottom)
# To ensure full credit, you should strip whitespace from both sides of each line.

Using whatever built-in approach makes the most sense to you, read in a given file line-by-line and return a list containing line-by-line strings with a small twist – whitespace (spaces, tabs, new line characters) should be removed from both ends of the line. If you are not sure why you need to do this, just try printing some of the files without them! For example, ‘mewlines’ are characters (\n) and they can really mess with the formatting of your output if you arent careful!

stringParseLineBreaks()

List[str] stringParseLineBreaks(string fileName, string bchar)
# INPUT:
# fileName is the relative path of the string file being processed (string)
# bchar is the break character (string)
# OUTPUT:
# A list of lists where each line in the file is parsed as a
# separate list by "splitting" each line at the break characters.
# NOTE:
# The output list should contain each line (even empty lines) in the order they are read (top to bottom)
# To ensure full credit, you should strip whitespace from both sides of each line *before* splitting.

The function should read each line of text while stripping whitespace – the same as stringParseLine()! But now in addition you need to then further break up the single line string into a list of substrings (thus making your return a matrix).

Your substring list should be all of the substrings formed by treating the break character as a boundary. This style of processing is more commonly referred to as splitting the string and is immensely important for parsing comma-separated values or space-separated values files. As an example:

x="1111 2222 3,4,5,"

When split by " ": ["1111", "2222", "3,4,5,"]
When split by ",": ["1111 2222 3","4","5", ""]

Note that splitting can yield (and should include) an empty string. In fact this is more likely to happen since you should be stripping whitespace from lines before splitting.

Note: You are allowed (and encouraged) to use Python built-ins to do this assignment. Do not feel like you have to re-invent the wheel.

matchingLines()

int matchingLines(int i, string fileName)
# INPUT:
# an integer i corresponding to the line number we are trying to count matches
# an input file consisting of strings separated by lines
# OUTPUT:
# An integer containing the count of matching lines to the line found at index i
# NOTE: There is always at least one matching line (line i always matches itself)

Given an input file, read in a file line-by-line while stripping whitespace from each line. Then try to match the indexed line i against all other lines (including itself) and return the total number of exactly matching lines. For example:

myFile:

A
B
A
B
A
C

Then indices zero, two, and four will return 3 and incides one and three will return 2. Index five will return 1 (as line i=5 only matches itself).

sumColumns()

List[int] sumColumns(sumIndices, file)
# INPUT:
# a list of integers corresponding to the columns to be summed on each row
# an input comma-separated values file to be parsed
# OUTPUT:
# A list of integers corresponding to the sum of the input columns at each row.
# NOTE: The output list should be the same size as the number of rows
# NOTE: You may assume all the indices you are asked to sum are integers but 
# should NOT assume that all csv columns are integers.

Given an input file, read in a csv file. That is to say, read in a file line-by-line while first stripping whitespace from each line and then splitting by commas. Then sum only the input column indicies to produce an integer for each row. It is worth noting that as you parse the data you should not expect every row to be an integer though you may assume all columns listed in the column indices to be integers. For example (in the following example), column 0 would never be passed as input but all other column numbers are valid.

myFile:

Bob, 1, 2, 3
Sally, 2, 2, 2
George, 3, 1, 9
sumIndices = [1, 2, 3] -> [6, 6, 13]
sumIndices = [2, 3] -> [5, 4, 10]