Python MapReduce Hadoop Streaming Job that requires 3 input files?

Question

I have 3 small sample input files (the actual files are much larger),

# File Name: books.txt
# File Format: BookID|Title
1|The Hunger Games
2|To Kill a Mockingbird
3|Pride and Prejudice
4|Animal Farm

# File Name: ratings.txt
# File Format: ReaderID|BookID|Rating
101|1|1
102|2|2
103|3|3
104|4|4
105|1|5
106|2|1
107|3|2
108|4|3

# File Name: readers.txt
# File Format: ReaderID|Gender|PostCode|PreferComms
101|M|1000|email
102|F|1001|mobile
103|M|1002|email
104|F|1003|mobile
105|M|1004|email
106|F|1005|mobile
107|M|1006|email
108|F|1007|mobile

I want to create a Python MapReduce Hadoop Streaming Job to get the following output which is the Average Rating by Title by Gender

Animal Farm F   3.5
Pride and Prejudice M   2.5
The Hunger Games    M   3
To Kill a Mockingbird   F   1.5

I searched this forum and someone pointed out a solution but it is for 2 input files instead of 3. I gave it a go but am stuck at the mapper part because I am not able to sort it correctly so that the reducer can appropriately recognise the 1st record for Title & Gender, then start aggregating. My mapper code below,

#!/usr/bin/env python
import sys
for line in sys.stdin:

    try:

        ReaderID = "-1"
        BookID = "-1"
        Title = "-1"
        Gender = "-1"
        Rating = "-1"

        line = line.strip()

        splits = line.split("|")

        if len(splits) == 2:
            BookID = splits[0]
            Title = splits[1]
        elif len(splits) == 3:
            ReaderID = splits[0]
            BookID = splits[1]
            Rating = splits[2]
        else:
            ReaderID = splits[0]
            Gender = splits[1]

        print('%s\t%s\t%s\t%s\t%s' % (BookID, Title, ReaderID, Rating, Gender))

    except:
        pass

PS: I need to use Python and Hadoop Streaming only. Not allowed to install Python packages like Dumbo, mrjob and etc.

Appreciate your help in advance.

Thanks, Lobbie

Answer 1

Went through some core Java MR and all have suggested, the three files cannot be merged together in a single map job. We have to first join the first two, and the resultant should be joined with the third one. Applying your logic for the three, does not give me good result. Hence, I tried with Pandas, and its seems to give promising result. If using pandas is not a constraint for you, please try my code. Else, we will try to join these three files with Python Dictionary and Lists.

Here is my suggested code. I have just concatenated all the input to test it. In you code, just comment my for loop (line #36) and un-comment your for loop (line #35).

import pandas as pd
import sys

input_string_book = [
"1|The Hunger Games",
"2|To Kill a Mockingbird",
"3|Pride and Prejudice",
"4|Animal Farm"]
input_string_book_df = pd.DataFrame(columns=('BookID','Title'))


input_string_rating = [
"101|1|1",
"102|2|2",
"103|3|3",
"104|4|4",
"105|1|5",
"106|2|1",
"107|3|2",
"108|4|3"]
input_string_rating_df = pd.DataFrame(columns=('ReaderID','BookID','Rating'))


input_string_reader = [
"101|M|1000|email",
"102|F|1001|mobile",
"103|M|1002|email",
"104|F|1003|mobile",
"105|M|1004|email",
"106|F|1005|mobile",
"107|M|1006|email",
"108|F|1007|mobile"]
input_string_reader_df = pd.DataFrame(columns=('ReaderID','Gender','PostCode','PreferComms'))

#for line in sys.stdin:
for line in input_string_book + input_string_rating + input_string_reader:
    try:

        line = line.strip()

        splits = line.split("|")

        if len(splits) == 2:
            input_string_book_df = input_string_book_df.append(pd.DataFrame([[splits[0],splits[1]]],columns=('BookID','Title')))
        elif len(splits) == 3:
            input_string_rating_df = input_string_rating_df.append(pd.DataFrame([[splits[0],splits[1],splits[2]]],columns=('ReaderID','BookID','Rating')))
        else:
            input_string_reader_df = input_string_reader_df.append(pd.DataFrame([[splits[0],splits[1],splits[2],splits[3]]]
            ,columns=('ReaderID','Gender','PostCode','PreferComms')))

    except:
        raise

l_concat_1 = input_string_book_df.merge(input_string_rating_df,on='BookID',how='inner')

l_concat_2 = l_concat_1.merge(input_string_reader_df,on='ReaderID',how='inner')

for each_iter in l_concat_2[['BookID', 'Title', 'ReaderID', 'Rating', 'Gender']].iterrows():
    print('%s\t%s\t%s\t%s\t%s' % (each_iter[1][0], each_iter[1][1], each_iter[1][2], each_iter[1][3], each_iter[1][4]))

Output

1       The Hunger Games        101     1       M
1       The Hunger Games        105     5       M
2       To Kill a Mockingbird   102     2       F
2       To Kill a Mockingbird   106     1       F
3       Pride and Prejudice     103     3       M
3       Pride and Prejudice     107     2       M
4       Animal Farm     104     4       F
4       Animal Farm     108     3       F

Python MapReduce Hadoop Streaming Job that requires 3 input files?

Question

1 answers

solution1
2 ACCPTED 2016-05-04 18:56:19

Python MapReduce Hadoop Streaming Job that requires 3 input files?

Question

1 answers

solution1 2 ACCPTED 2016-05-04 18:56:19

solution1
2 ACCPTED 2016-05-04 18:56:19