Machine Problem 4 (MP4): Creating a search engine using Hadoop

1. Objectives and Overview:

This MP is focused on helping you learn the basics of the Map-Reduce Paradigm, and coding in Hadoop. Here below you can find the list of the main objectives of this MP:

Cheating and code copying is strictly prohibited. Copied code will result in the entire assignment being discarded from grading at the very least. This includes copying code from the internet.

2. Development Setup:

You will work on the EWS Lab Linux machines provided to you. These are hosted at remlnx.ews.illinois.edu. Please refer to http://it.engineering.illinois.edu/ews/lab-information/remote-connections for details of usage.

You are encouraged to discuss design ideas and bugs on Piazza. The newsgroups are a great tool for collective learning. However, please refrain from posting large amounts of code. Two or three lines of code are fine. High-level pseudo-code is also fine. 

3. Introduction:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. It is a framework for processing embarrassingly parallel problems across huge datasets using a large number of computers.

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the seminal paper MapReduce: Simplifed Data Processing on Large Clusters .

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It enables applications to work with thousands of computation-independent computers and petabytes of data. Hadoop was derived from Google's MapReduce. Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at Yahoo at the time, named it after his son's toy elephant!

4. Getting started with Hadoop:

Copy the starter folder located in /class/cs423/MP4/starter/ to your home directory. The /class/cs423 folder may not be visible in your file browser, and will appear only after you have accessed it through the terminal. Using cp -r /class/cs423/MP4/starter/ ~/ should do the trick.

We will be using the hadoop 1.0.4 distribution that is provided in the hadoop-1.0.4 folder.
An example program that counts the number of times a word appears in the given documents is provided in the word_count folder.
Run make in the word_count folder and copy the .jar file into the hadoop-1.0.4 folder.
The files on which we want to run wordCount are provided in the input folder, and we want to get the output in a folder named output. If such an output folder exists, then the code might give an error. There is a script ./run.sh provided in the hadoop-1.0.4 folder, which removes such a folder, if it exists. The input and output folder names are provided as command line arguments, as can be observed in the script run.sh.
Run chmod +x run.sh and then ./run.sh. The output should be provided in the newly created output folder.

Do read up the mapred tutorial for further information.

5. Problem Description and Implementation Overview:

In this MP, we will be creating a search engine that indexes the dumps of Wikipedia that are stored in /class/cs423/MP4/data in the EWS machines. There are two xml files that need to be indexed. Since you only need the read access to these files, you do not have to copy these files to your home directory. In case you want to use different Hadoop set-up / machine, you can download the data files from Box (Compressed 600MB).

Parsing: In the first step, the xml files need to be parsed into individual articles. The DOM structure of each of the xml files is mediawiki->[page->title, text]+. It might be advisable to split the xml files into numbered documents for easy reference, each containing one wikipedia article, defined by a title and the associated text. There are a number of libraries in different languages that can help you perform this task.

structure

Map/Reduce: During mapping, every single-worded token from each document (title + text) is read and is emitted as a (token, document id) pair. All tokens are converted to lower-case before emitting the pair.

The java code for tokenizing would be String[] tokens = line.toLowerCase().split("[^a-z]")


During reduction, we form the inverted index, i.e. create a table with single-worded tokens as the key, and the list of documents (with counts) it is present in as the value. Inverted Index

As an example, suppose we have the following three documents:
1: "Hello World"
2: "HELLO again"
3: "Goodbye cruel world, goodbye"

The (token, document id) pairs after map phase will be (hello, 1), (world, 1), (hello, 2), (again, 2), (goodbye, 3), (cruel, 3), (world, 3), (goodbye, 3).

The inverted document index would be as follows:
hello: (1 [1], 2 [1])
world: (1 [1], 3 [1])
again: (2 [1])
goodbye: (3 [2])
cruel: (3 [1])


Square brackets shows the term frequency of token in the documents. Remember that both the title and the text of the document need to be indexed in the case of the wiki dump. An example text element from the wiki dump is :
 
{{for|album of the same name|Octavarium (album)}}
{{Multiple issues|refimprove=May 2007|onesource=March 2009|cleanup=August 2007}}

{{Infobox song
| Name = Octavarium
| Cover =
| Format = [[Compact disc|CD]]
| Artist = [[Dream Theater]]
| Album = [[Octavarium (album)|Octavarium]]
| track_no = 8
| Recorded = 2005
| Genre = [[Progressive metal]], [[symphonic metal]], [[progressive rock]], [[ambient music]]
| Length = 24:00
| Label = [[Atlantic Records]]
| Writer = [[James LaBrie]], [[John Petrucci]], [[Mike Portnoy]]
| Composer = [[Dream Theater]]
| prev           = "Sacrificed Sons"
| prev_no        = 7
| Misc = 
}}

'''Octavarium''' is a song by [[progressive metal]] band [[Dream Theater]], from the [[Octavarium (album)|album of the same name]]. 

The song starts with [[Jordan Rudess]] using his [[Continuum (instrument)|Haken Continuum]] and his [[lap steel guitar]], drawing references from [[Pink Floyd]]'s "[[Shine On You Crazy Diamond]]", [[Tangerine Dream]], [[Marty Friedman (guitarist)|Marty Friedman]]'s ''[[Scenes (album)|Scenes]]'', and [[Queen (band)|Queen]]'s "[[Innuendo (album)#Bijou|Bijou]]".
 
Just retrieve the text
 
{{for|album of the same name|Octavarium (album)}}
{{Multiple issues|refimprove=May 2007|onesource=March 2009|cleanup=August 2007}}

{{Infobox song
| Name = Octavarium
| Cover =
| Format = [[Compact disc|CD]]
| Artist = [[Dream Theater]]
| Album = [[Octavarium (album)|Octavarium]]
| track_no = 8
| Recorded = 2005
| Genre = [[Progressive metal]], [[symphonic metal]], [[progressive rock]], [[ambient music]]
| Length = 24:00
| Label = [[Atlantic Records]]
| Writer = [[James LaBrie]], [[John Petrucci]], [[Mike Portnoy]]
| Composer = [[Dream Theater]]
| prev           = "Sacrificed Sons"
| prev_no        = 7
| Misc = 
}}

'''Octavarium''' is a song by [[progressive metal]] band [[Dream Theater]], from the [[Octavarium (album)|album of the same name]]. 

The song starts with [[Jordan Rudess]] using his [[Continuum (instrument)|Haken Continuum]] and his [[lap steel guitar]], drawing references from [[Pink Floyd]]'s "[[Shine On You Crazy Diamond]]", [[Tangerine Dream]], [[Marty Friedman (guitarist)|Marty Friedman]]'s ''[[Scenes (album)|Scenes]]'', and [[Queen (band)|Queen]]'s "[[Innuendo (album)#Bijou|Bijou]]".
and apply the tokenizer without any further cleaning of the text.

 

Searching:Output this inverted index to disk and write a program that can load this inverted index and give results for single-word and multi-word token queries made by the user. For multi-word queries, you need to take intersection of document lists of individual tokens. The list of top ten documents ranked according to the metric below must be returned for each query made. 

- Ranking: For any token, rank the list of documents it is found in using the term frequency (tf) , i.e. the document in which the term appears the most will be ranked the first, the document in which the term appears the second most number of times will be ranked second, and so on. The results for multi-word queries should be ranked according to the combined term frequency (tf) of the documents (after taking intersection).

The search results should be the titles of the relevant pages.

Example: 

Search term = "Tennis"

Results:

1) 2008 US Open (tennis)

2) List of table tennis players

3) World Table Tennis Championships

4) 2008 ATP Tour

6. Demo:

Have a program which when given search terms will produce output as shown above in red. The pre-processing, and MapReduce computation should be performed beforehand and the necessary inverted index should be available for your demo program.

7. References:

Links have been provided at the appropriate places in the writeup above