Learning Objectives
- Practice data parsing and multi-dimensional lists in Python
- Introduce data visualization using matplotlib
- Explore a time-sensitive computational model (Cellular Automata)
- Modify and use an existing code base to perform data science
Getting set up
While most of the mini-project is autograded (and should be submitted like any regular assignment), Prairielearn does not have the physical memory or speed to do meaningful data science. Accordingly, when you submit part 2 of this assignment, you will do so by saving your dataset, your modified code, and your output to a github repository.
For part 1, you will only need to submit the following on Prairielearn:
CellAuto.ipynb
You will only need to edit the following functions:
emptyMatrix()
getSize()
importValues()
update()
For part 2, you will need to submit:
- A short proposal detailing your dataset and expected visualization
- A link to your git repo
Assignment Description
Many very complex problems in the world can be approximated by modeling or approximating a phenomena using some manner of computational or mathematical rules. These rules do not exactly mirror reality but can offer fantastic insight into the underlying behavior or effects of changes to the system. For example, a model of a cell may have complex rules defining how different proteins are created and scientists may use this model to guess good targets for a disease or cancer. However they are merely predicting in silico
while doing so and their drug may not work as intended or have unforeseen side-effects in the real world.
While many of these models are very advanced or rely on algorithms and data structures we have not yet discussed, a fundamental computational model can be created using only lists and a few simple rules. More to the point, in this mini-project you will be tasked with coding up several key functions for a time-sensitive state machine that is built around a 2-dimensional matrix. In other words, you will be practicing your skills with lists and multi-dimensional lists while creating a tool you can (and will) apply to visualizing datasets of your choice!
Part 1: The Cellular Automata Model (50 Points)
Cellular Automata
refers to a large range of computational models built around a two-dimensional plane organized into ‘cells’ and rules that govern how these cells change over time. One of the more popular rulesets was created by mathematician John Conway in 1970 – ‘Conways Game of Life’. For the first part of your mini-project, you are tasked with implementing a number of key functions needed to load and visualize various patterns.
emptyMatrix()
#INPUT:
# An integer rows storing the number of rows
# An integer cols storing the number of columns
#OUTPUT:
# A list of lists forming a 2D matrix.
# The output must be either a numpy.ndarray or a built-in python nested list
List[List[int]] emptyMatrix(int rows, int cols):
The first required function, emptyMatrix()
, should take as input a row and column number and produce an empty matrix of the appropriate dimensions. You may assume that all rows and columns will have the same length. In this instance, ‘empty’ cells should be given the value of 0 and live cells should have the value of 1. So this function essentially makes an ‘empty’ gameboard for future use. Your output can either be a numpy n-dimensional array (numpy.ndarray
) or a built-in nested list (List[List[int]]
).
HINT: You can do this manually using the nested loop and nested list logic we’ve seen in class. However you are also allowed to use numpy, which has a number of useful methods in their list implementation, some of which we have also seen in class.
getSize()
#INPUT:
#A lists of lists forming a 2D matrix (using a built-in list)
#OUTPUT:
# A list of size 2 storing the number of rows at index 0 and the number of columns at index 1
List[int] getSize(List[List[int]] matrix):
Given a matrix input as a list of lists (not a numpy array!), return a list containing two elements. The number of rows should be at index 0 and the number of columns should be at index 1.
HINT: Although the autograder will always pass in a built-in list, you can still use numpy helper methods. How might you convert a list from one type to another? Alternatively, this is also straightforward to write using built-in methods.
importValues()
#INPUT:
# A lists of lists forming a 2D matrix (using a built-in list)
# A string fName storing the absolute path to a space-separated text file
#OUTPUT:
# A list of lists forming a 2D matrix.
# The output must be either a numpy.ndarray or a built-in python nested list
List[List[int]] importValues(List[List[int]] matrix, string fName):
The input data format for this project is a space-separated text file storing a row and column coordinate pair for a single live cell one each line. For example, the ‘file’:
0 3
2 1
implies that there are two live cells, one located row 0, column 3 in the matrix and the other located at row 2, column 1 in the matrix. Your code should read in the input file by it’s filename, parse every pair, and add them to the output matrix.
Remember: Live cells have the value 1, dead cells have the value 0.
WARNING: To receive full credit, your solution must be able to identify when an input pair would go out of bounds for a given input matrix. When this happens, you should ignore that particular point and continue parsing the list. For example if I have a 5x5 matrix:
-3 3
2 5
neither one of these pairs are valid. -3 is not in the bounds of our rows and 5 is not in the bounds of our columns (both of which go from 0 to 4).
TIP: It is easy to mix up or confuse (row, column) with (x, y) from more conventional plots. Point (0,0) as row, column is the top left corner. Try to match your output to vertLine.txt
and horizLine.txt
to make sure you arent inverting things.
update()
#INPUT:
# A lists of lists forming a 2D matrix (using a built-in list)
#OUTPUT:
# A list of lists forming a 2D matrix.
# The output must be either a numpy.ndarray or a built-in python nested list
List[List[int]] update(List[List[int]] matrix):
The cellular automata system is a finite matrix of cells and four rules that define how the matrix changes with each time step:
- Any live cell with fewer than two live neighbours dies. (Underpopulation)
- Any live cell with two or three live neighbors lives.
- Any live cell with more than three live neighbours dies. (Overpopulation)
- Any dead cells with exactly three live neighbours becomes a live cell. (Reproduction)
While most conventional models fake an infinite grid by connecting the top and bottom as well as the left and right edges, our implementation will instead always consider those edges to be empty. Despite this, we can still see a number of interesting behaviors, including infinite patterns and patterns which will gradually move off our finite grid.
TIP: The autograder will tell you when you have a specific cell value out of place along with the timestamp it is occuring. However you may find it more helpful to visually debug against some some pre-generated examples (which can be found in the output directory).
Part 2: 2D Matrix Data Exploration (25 points)
If you complete part 1, congratulations! You now have a Python script capable of reading in two-dimensional points and visualize changes in that dataset over time. Ignoring the cellular automata specific update()
function, your code should work for just about any 2D dataset.
Your next task is to find such a dataset and modify your existing code base to visualize it. Your dataset can be a single still image or a collection of datasets which you visualize as a changing gif. It can be a binary dataset (cells are either alive or dead and have only two colors) like the cellular automata or you can modify your code to display multiple colors and read in and store weights at each position. The options are (functionally) limitless – your only restrictions are that it should be a real dataset (not a fake dataset created by hand) and is it must be visualizable in two dimensions (and make sense to do so)!
Find and describe a dataset
To describe a dataset for this mini-project, you must provide the following information as directly and succinctly as possible:
Dataset Link: Where did you find your dataset? Provide a direct link to its source but do not give the download link directly. (The website you link should ideally include some information about the dataset you have selected).
Dataset Format: What is the format of your input file? At minimum, you must give the total number of rows in your dataset and what each column means even if it is as trivial as an (x,y) coordinate pair. If you have multiple input files (such as different states on a chess game) you should describe this too.
Dataset Visualization: What will your output look like? At minimum, this should describe what a row / column means and what each color in your visualization means.
A quick example: (Note this was not an easy dataset to visualize as a 2D matrix and accordingly more work was needed – find a dataset that matches your interests.)
Link: I plan to visualize changes in air pollution over time using the following dataset: kaggle.
Format: The dataset is a csv-file with eight pollution measurements (concentrations of CO, NO, NO2, O3, SO2, PM2.5, PM10, and NH3) for 18,776 distinct timestamps, one per hour between November 25, 2020, and January 24, 2023. Given that there should be 18,960 hours between those timestamps, it is likely that a few hours are missing occassionally. Accordingly, I will have a separate label for missing entries in my visualization.
Visualization: I will make a single 2D visualization per day and plot at least 100 days (my output gif will have at least 100 frames). I will have eight columns (corresponding to each of the eight pollution measurements) and 24 rows (corresponding to each hour in the day).
Each column will be reduced to three values (0, 1, and 2) corresponding to the colors (blue, yellow, red). To flatten my dataset, I will calculate the min and max value for each column and partition the range of values evenly. In other words, I will assign the bottom third of the range to blue, the middle third to yellow, and the top third to red. Missing entries will have the fourth color (black).
Submit your data visualization (and code)
As almost every dataset will require at least a little adjustment to the code you built in part 1, you will need to submit your modified code as well as your final visualization separately. WARNING: Be very careful not to modify your submission to part 1 when completing part 2! Even minor changes to the functions can break the autograder and cost you significant points.
Your final visualization should be either a pdf or a gif and should be linked directly from your github repo.
Your final code submission should be directly runnable so be sure your github repo includes either your full dataset or a small subset of your dataset (if the full dataset is too large for github).