MP 9: Local MapReduce
Overview
In lecture, you have learned about MapReduce and, in Homework 9, read the original research paper that introduced MapReduce to the Computer Science community. In this MP, you will implement MapReduce mappers and reducers to solve many classical problems using MapReduce.
Specifically:
- You will write several mapper functions to map files into key-value pairs that can be reduced,
- You will write several reducer functions to reduce key-values sets,
- You will solve different types of problems using MapReduce
MP9 Overview Session
An MP9 overview session was held on Monday, Nov. 1 at 6:00pm on Zoom.
Initial Files
In your CS 240 directory, merge the initial starting files with the following commands:
git fetch release
git merge release/mp9 -m "Merging initial files"
Machine Problem
You will complete the code to accomplish the following MapReduce tasks:
- Count the number of letters in text, in
count_letters.py
- Count the number of words in text, in
count_words.py
- Count the length of words in text, in
count_word_length.py
- Count the number of bigrams in text, in
count_bigrams.py
- Find the set of mutual friends of pairs of friends, in
find_mutual_friends.py
.
Clarification on find_mutual_friends.py
- You are expected to output mutual friends for pair of users who are already “friends”. If your output contains mutual friends for those pair of users who are not friends - it is still fine. But apart from pair of users as the key and their mutual friends as the value (in the format specified in the MP release) you shouldn’t output any other information.
- You can assume that the friendship relationship is symmetric i.e. if A has B as a friend, B will have A as friend too.
- Your implementation should run on both
data/graph0/*.txt
as well asdata/graphs/graph-0.txt
i.e. the input file doesn’t need to always contain just one user’s data. It can have multiple users’ data. - For the purpose of this MP, you can assume that a single line will represent the complete list of friends for that user. Note that the ideal solution doesn’t need to rely on this assumption and can support multiple lines of input for a single user.
Required Structure by Example: count_letters.py
Each of these tasks must be solved using a mapper and a reducer in MapReduce. The code for count_letters.py
has been provided and all functions should use an identical structure:
if __name__ == '__main__':
mr = MapReduce(map_letters.map, reduce_sum.reduce)
mr( sys.argv[1:] )
Mapper Functions
In count_letters.py
, the code imports mappers/map_letters.py
. This, and all map functions, must implement a function named map
that receives the name of the input file as its one and only parameter. The map
function must completely read the file and return a dictionary (native Python dict
) representation of the file as key-value pairs OR an array of dictionaries (if the one file should create multiple summaries).
Your code for mappers/map_letters.py
might have the following structure:
# map function:
def map(filename):
# Open the file and read the contents:
with open(filename, "r", encoding="utf-8") as f:
lines = f.read()
# Accumulate the letter count in a dictionary, to create:
# {"a": 3, "b": 7, ...}
# ...
# Return the dictionary:
return d
When writing the mappers:
- You must place your mappers in the
mappers
directory. You will need to create new files. - For mappers that deal with text, transform the text to all lower case and treat all non-letter characters as delimiters. (Letters are only
a
-z
.) - For mappers that deal with other data, do not transform them in any way.
Reducer Functions
In count_letters.py
, the code imports reducers/reduce_sum.py
. This, and all reduce functions, must implement a function named reduce
that receives two dictionaries (native Python dict
) and return one reduced dictionary (also as a native dict
).
Your code for reducers/reduce_sum.py
might have the following structure:
def reduce(left, right):
# Merge `left` into `right`:
# ...
return right
When writing your reducers:
- You must place your reducers in the
reducers
directory. You will need to create new files.
Testing Your Code
Each MapReduce program can be run by calling Python with arguments for the input files. A number of example input files have been provided for you in data
. For example:
-
py count_letters.py data/books/*.txt
will count the letters in the files contained indata/books/*.txt
files. A subset of the expected output is provided at the top ofcount_letters.py
. (The text from the books come from Gutenberg Project texts.) -
py count_bigrams.py data/taylor/*.txt
will count the letters in the files contained indata/taylor/*.txt
files. A subset of the expected output is provided at the top ofcount_bigrams.py
. (The lyrics are from Taylor Swift’s “Blank Space” song.) -
…etc…
Submit
When you have completed your program, double-check that your program runs as expected by the specifications above. When you are ready, submit the code via the following git commands:
git add -u
git add mappers # Add all the new mappers you created
git add reducers # Add all the new reducers you created
git commit -m "MP submission"
git push origin master
You can verify your code was successfully submitted by viewing your git repo on your github account: https://github.com.