Final Project Example Algorithms

by Brad Solomon

Choosing an algorithm

A good final project answers a non-trivial question through the implementation of a non-trivial algorithm. A great final project does all of this using an algorithm that was not directly covered in class. However it is often difficult to both come up with a named algorithm not discussed in class and even more difficult to pick an algorithm which is of reasonable difficulty. This resource is intended to curate a number of reasonable algorithm choices for common data science questions.

Graph Algorithms

If using a graph dataset as input, there are a number of viable graph algorithms that can be used to answer many interesting questions. This section contains a non-exhaustive list of such algorithms.

Shortest Path Algorithms

Both Dijkstra’s and Floyd-Warshall Algorithms for shortest path are accessible implementations of the shortest path problem. An interesting extension of shortest path is landmark path, which uses these same algorithms to produce a single path through many landmarks in a particular order in the graph.

Minimum Spanning Tree

Both Kruskal’s Algorithm and Prim’s Algorithm for creating a minimum spanning tree are reasonable projects for graph data and can be used creatively to answer interesting questions on graph data.

A* Search

A* search is a graph traversal and path search algorithm and can be thought of as a more efficient and more complex Dijkstra. Implementing A* will be a challenge, especially because you have to create a valid heuristic. Failing to define one correctly will result in the complete failure of the algorithm, so be careful and ask for help early if you take this path!

Iterative Deepening DFS

Breadth-first search and depth-first search are too basic to be a valid final project (as we will cover them thoroughly in class). Iteartive deepening DFS is a reasonable extension from these lectures however and is a much more efficient solution then either strategy discussed in class.

Betweeness Centrality

Betweenness centrality is a measure of connectivity in a graph network based on shortest path algorithms and can be used to answer a number of questions about social networks or the overall centrality of a graph. At the same time, it can be implemented using repeated applications of Floyd-Warshall and is thus a fairly accessible algorithm for a final project.

Eulerian Path / Cycle Identification

A Eulerian path is one that visits every edge exactly once and identifying such a path can be useful to a number of leading questions. Like many of these algorithms, wikipedia lists several possible implementations. In this case, the simplest is Fleury’s which is somewhat novel relative to the course content but is still reasonably accessible.

Strongly Connected Components

A strongly connected component is one where every vertex is reachable from every other vertex. This is very useful to identify clusters of objects in a dataset which are related in some way and there are a number of implementations which build off of depth-first search to compute these components in linear time.

PageRank Algorithm

PageRank is a very interesting algorithm historically used by Google to rank web pages for their search engine. Implementing PageRank requires having external knowledge of probabilities and a probability distribution, but the actual specifics of the algorithm are relatively straightforward given that knowledge. Do not do PageRank unless you are comfortable with writing up mathematical expressions in code (and know probability).

CSV, JSON, or plaintext Algorithms

There is much less structure to a CSV, JSON or plaintext data source and therefore there is a much greater range of possibilities. This freedom can be quite dangerous – as the project must be sufficiently challenging to be a good capstone to the course but not so difficult as to be impossible. Given that, here are a few interesting algorithms for you to consider when selecting a non-graph dataset or non-graph algorithm.

Bloom Filters

A bloom filter is a space-efficient but lossy data structure that can answer a single question – ‘is an input value a member of the set of all objects stored in the bloom filter?’ (Even then it cannot do so with 100% accuracy). At the same time, it is a very accessible data structure given a modest understanding of hashing and can be a great project for storing very large datasets in a very small space while still allowing them to be searched directly.

Suffix Arrays

A suffix array is a data storage structure for strings that can be built using depth-first traversal of a suffix tree or constructed directly in linear time. This is a much simpler algorithm then most or all of the other data structures and would have to be combined with a truly interesting use-case or other addition to be a valid final project.

Burrows-Wheeler Transform

The Burrows-Wheeler Transform is a lossless and reversible data transform that can be used to improve compression of an input text while still being directly searchable for an arbitrary substring. Some creativity would be needed to identify how to store most datasets in a BWT but a text-heavy input could use a BWT algorithm alongside some other algorithms or approaches.