I have 3 small sample input files (the actual files are much larger),
# File Name: books.txt
# File Format: BookID|Title
1|The Hunger Games
2|To Kill a Mockingbird
3|Pride and Prejudice
4|Animal Farm
# File Name: ratings.txt
# File Format: ReaderID|BookID|Rating
101|1|1
102|2|2
103|3|3
104|4|4
105|1|5
106|2|1
107|3|2
108|4|3
# File Name: readers.txt
# File Format: ReaderID|Gender|PostCode|PreferComms
101|M|1000|email
102|F|1001|mobile
103|M|1002|email
104|F|1003|mobile
105|M|1004|email
106|F|1005|mobile
107|M|1006|email
108|F|1007|mobile
I want to create a Python MapReduce Hadoop Streaming Job to get the following output which is the Average Rating by Title by Gender
Animal Farm F 3.5
Pride and Prejudice M 2.5
The Hunger Games M 3
To Kill a Mockingbird F 1.5
I searched this forum and someone pointed out a solution but it is for 2 input files instead of 3. I gave it a go but am stuck at the mapper part because I am not able to sort it correctly so that the reducer can appropriately recognise the 1st record for Title & Gender, then start aggregating. My mapper code below,
#!/usr/bin/env python
import sys
for line in sys.stdin:
try:
ReaderID = "-1"
BookID = "-1"
Title = "-1"
Gender = "-1"
Rating = "-1"
line = line.strip()
splits = line.split("|")
if len(splits) == 2:
BookID = splits[0]
Title = splits[1]
elif len(splits) == 3:
ReaderID = splits[0]
BookID = splits[1]
Rating = splits[2]
else:
ReaderID = splits[0]
Gender = splits[1]
print('%s\t%s\t%s\t%s\t%s' % (BookID, Title, ReaderID, Rating, Gender))
except:
pass
PS: I need to use Python and Hadoop Streaming only. Not allowed to install Python packages like Dumbo, mrjob and etc.
Appreciate your help in advance.
Thanks, Lobbie
Went through some core Java MR and all have suggested, the three files cannot be merged together in a single map job. We have to first join the first two, and the resultant should be joined with the third one. Applying your logic for the three, does not give me good result. Hence, I tried with Pandas, and its seems to give promising result. If using pandas is not a constraint for you, please try my code. Else, we will try to join these three files with Python Dictionary and Lists.
Here is my suggested code. I have just concatenated all the input to test it. In you code, just comment my for loop (line #36) and un-comment your for loop (line #35).
import pandas as pd
import sys
input_string_book = [
"1|The Hunger Games",
"2|To Kill a Mockingbird",
"3|Pride and Prejudice",
"4|Animal Farm"]
input_string_book_df = pd.DataFrame(columns=('BookID','Title'))
input_string_rating = [
"101|1|1",
"102|2|2",
"103|3|3",
"104|4|4",
"105|1|5",
"106|2|1",
"107|3|2",
"108|4|3"]
input_string_rating_df = pd.DataFrame(columns=('ReaderID','BookID','Rating'))
input_string_reader = [
"101|M|1000|email",
"102|F|1001|mobile",
"103|M|1002|email",
"104|F|1003|mobile",
"105|M|1004|email",
"106|F|1005|mobile",
"107|M|1006|email",
"108|F|1007|mobile"]
input_string_reader_df = pd.DataFrame(columns=('ReaderID','Gender','PostCode','PreferComms'))
#for line in sys.stdin:
for line in input_string_book + input_string_rating + input_string_reader:
try:
line = line.strip()
splits = line.split("|")
if len(splits) == 2:
input_string_book_df = input_string_book_df.append(pd.DataFrame([[splits[0],splits[1]]],columns=('BookID','Title')))
elif len(splits) == 3:
input_string_rating_df = input_string_rating_df.append(pd.DataFrame([[splits[0],splits[1],splits[2]]],columns=('ReaderID','BookID','Rating')))
else:
input_string_reader_df = input_string_reader_df.append(pd.DataFrame([[splits[0],splits[1],splits[2],splits[3]]]
,columns=('ReaderID','Gender','PostCode','PreferComms')))
except:
raise
l_concat_1 = input_string_book_df.merge(input_string_rating_df,on='BookID',how='inner')
l_concat_2 = l_concat_1.merge(input_string_reader_df,on='ReaderID',how='inner')
for each_iter in l_concat_2[['BookID', 'Title', 'ReaderID', 'Rating', 'Gender']].iterrows():
print('%s\t%s\t%s\t%s\t%s' % (each_iter[1][0], each_iter[1][1], each_iter[1][2], each_iter[1][3], each_iter[1][4]))
Output
1 The Hunger Games 101 1 M
1 The Hunger Games 105 5 M
2 To Kill a Mockingbird 102 2 F
2 To Kill a Mockingbird 106 1 F
3 Pride and Prejudice 103 3 M
3 Pride and Prejudice 107 2 M
4 Animal Farm 104 4 F
4 Animal Farm 108 3 F
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.