MP 9: Local MapReduce

Extra Credit: Completed and turned in via git before November 5, 2021 at 11:59pm
Due Date: Completed and turned in via git before November 7, 2021 at 11:59pm
Points: MP 9 is worth 50 points

Overview

In lecture, you have learned about MapReduce and, in Homework 9, read the original research paper that introduced MapReduce to the Computer Science community. In this MP, you will implement MapReduce mappers and reducers to solve many classical problems using MapReduce.

Specifically:

  • You will write several mapper functions to map files into key-value pairs that can be reduced,
  • You will write several reducer functions to reduce key-values sets,
  • You will solve different types of problems using MapReduce

MP9 Overview Session

An MP9 overview session was held on Monday, Nov. 1 at 6:00pm on Zoom.

Initial Files

In your CS 240 directory, merge the initial starting files with the following commands:

git fetch release
git merge release/mp9 -m "Merging initial files"

Machine Problem

You will complete the code to accomplish the following MapReduce tasks:

  • Count the number of letters in text, in count_letters.py
  • Count the number of words in text, in count_words.py
  • Count the length of words in text, in count_word_length.py
  • Count the number of bigrams in text, in count_bigrams.py
  • Find the set of mutual friends of pairs of friends, in find_mutual_friends.py.

Clarification on find_mutual_friends.py

  • You are expected to output mutual friends for pair of users who are already “friends”. If your output contains mutual friends for those pair of users who are not friends - it is still fine. But apart from pair of users as the key and their mutual friends as the value (in the format specified in the MP release) you shouldn’t output any other information.
  • You can assume that the friendship relationship is symmetric i.e. if A has B as a friend, B will have A as friend too.
  • Your implementation should run on both data/graph0/*.txt as well as data/graphs/graph-0.txt i.e. the input file doesn’t need to always contain just one user’s data. It can have multiple users’ data.
  • For the purpose of this MP, you can assume that a single line will represent the complete list of friends for that user. Note that the ideal solution doesn’t need to rely on this assumption and can support multiple lines of input for a single user.

Required Structure by Example: count_letters.py

Each of these tasks must be solved using a mapper and a reducer in MapReduce. The code for count_letters.py has been provided and all functions should use an identical structure:

if __name__ == '__main__':
  mr = MapReduce(map_letters.map, reduce_sum.reduce)
  mr( sys.argv[1:] )

Mapper Functions

In count_letters.py, the code imports mappers/map_letters.py. This, and all map functions, must implement a function named map that receives the name of the input file as its one and only parameter. The map function must completely read the file and return a dictionary (native Python dict) representation of the file as key-value pairs OR an array of dictionaries (if the one file should create multiple summaries).

Your code for mappers/map_letters.py might have the following structure:

# map function:
def map(filename):
  # Open the file and read the contents:
  with open(filename, "r", encoding="utf-8") as f:
    lines = f.read()

  # Accumulate the letter count in a dictionary, to create:
  #   {"a": 3, "b": 7, ...}
  # ...

  # Return the dictionary:
  return d

When writing the mappers:

  • You must place your mappers in the mappers directory. You will need to create new files.
  • For mappers that deal with text, transform the text to all lower case and treat all non-letter characters as delimiters. (Letters are only a-z.)
  • For mappers that deal with other data, do not transform them in any way.

Reducer Functions

In count_letters.py, the code imports reducers/reduce_sum.py. This, and all reduce functions, must implement a function named reduce that receives two dictionaries (native Python dict) and return one reduced dictionary (also as a native dict).

Your code for reducers/reduce_sum.py might have the following structure:

def reduce(left, right):
  # Merge `left` into `right`:
  # ...
    
  return right

When writing your reducers:

  • You must place your reducers in the reducers directory. You will need to create new files.

Testing Your Code

Each MapReduce program can be run by calling Python with arguments for the input files. A number of example input files have been provided for you in data. For example:

  • py count_letters.py data/books/*.txt will count the letters in the files contained in data/books/*.txt files. A subset of the expected output is provided at the top of count_letters.py. (The text from the books come from Gutenberg Project texts.)

  • py count_bigrams.py data/taylor/*.txt will count the letters in the files contained in data/taylor/*.txt files. A subset of the expected output is provided at the top of count_bigrams.py. (The lyrics are from Taylor Swift’s “Blank Space” song.)

  • …etc…

Submit

When you have completed your program, double-check that your program runs as expected by the specifications above. When you are ready, submit the code via the following git commands:

git add -u
git add mappers   # Add all the new mappers you created
git add reducers  # Add all the new reducers you created
git commit -m "MP submission"
git push origin master

You can verify your code was successfully submitted by viewing your git repo on your github account: https://github.com.