Final Project Example Datasets

by Brad Solomon

Choosing a dataset

The goal of the final project is to give you a chance to apply the skills and knowledge you’ve developed throughout the class to answer a question that is of meaningful interest to you. Accordingly, you are strongly encouraged to identify an external dataset that is not listed here. That said, for those struggling to identify a data science project of interest to them here is a good place to start.

Graph Data

Graph algorithms require graph inputs and while you can sometimes build a graph based on some other form of input, it is often easier to just start with a graph dataset. If you find a dataset here but are struggling to identify a graph algorithm (a reasonable problem given that graphs are covered late in the semester), take a look at the algorithms page.

OpenFlights

Open flights is an open source data set of flight routes and airports. The data is currently dated from 2014 and has no path for updates but is still interesting for historical study. As a project you would load the data from at least the routes and treat this as a directed graph.

One key question in this case is going to be is what do you use for weights for the graph. There are several intuitive choices:

Count the number of routes on some edge and use (1/number of routes). This would give a measure of throughput on the route.
Use the airport data to compute the distance and use that as the weight.
Augment the graph with a separate node for each airline at the airport and the airport itself. Then add edges between each airline node at the airport. You can then have a standard weight for flights and for moving from airlines to airports.

You can find this dataset and descriptions here: OpenFlights

Stanford Large Network Dataset Collection

This is a large repository of graph data sets in a very general format and contains both directed and undirected graph data as well as both weighted and unweighted data.

Almost every graph algorithm can be used to answer a question within these datasets. The most important choice here is going to be selecting an appropriate match of an algorithm and a dataset. For example, many algorithms don’t work on directed data. Others require weighted data. That said, due to the very clean and simple format for these datasets, choosing one of them is probably the most straightforward option for the project.

These datasets can be found here: StanfordData

CSV Datasets

The most common and best-supported file type for arbitrary datasets is the CSV. Algorithms that deal with storing or searching a large collections of data will likely begin with a CSV file that is parsed for relevant information. Many of these datasets can be very large so be careful when trying to use them!

Kaggle

Kaggle is a great resource for reasonably sized but meaningful datasets in both CSV and JSON format. Even better, it has a semi-functional search tool to find data that may interest you! As a fun fact, many of the mp ‘toy’ datasets were influenced by a real-world Kaggle equivalent.

These datasets can be found here: Kaggle

US Data Collections

For those interested in large-scale public policy or related datasets, thanks to the M-13-13 Open Data Policy you can access a significant amount of federally collected information. The format is not entirely consistent – many of the datasets are automatically datamined from websites and the like – but you can find CSV and JSON formats with a little bit of searching.

These datasets can be found here: data.gov

The US Census Bureau also has a similar website which you can you use to access population data for the United States. The website format here is more consistent (and you can choose between CSV and JSON) but extracting the raw data is both harder to do and will frequently fail to download. Use it at your own peril.

These datasets can be found here: US Census

Data Science Central

Data Science Central has entirely too many articles and curated lists of datasets and data science projects. Using these resources will be a more involved process, as these link to other repositories rather than other datasets directly. Two such resources (which were released in 2013 and 2015 respectively) are: