需要 3 个输入文件的 Python MapReduce Hadoop 流作业？

Question

I have 3 small sample input files (the actual files are much larger),我有 3 个小样本输入文件（实际文件要大得多），

# File Name: books.txt
# File Format: BookID|Title
1|The Hunger Games
2|To Kill a Mockingbird
3|Pride and Prejudice
4|Animal Farm

# File Name: ratings.txt
# File Format: ReaderID|BookID|Rating
101|1|1
102|2|2
103|3|3
104|4|4
105|1|5
106|2|1
107|3|2
108|4|3

# File Name: readers.txt
# File Format: ReaderID|Gender|PostCode|PreferComms
101|M|1000|email
102|F|1001|mobile
103|M|1002|email
104|F|1003|mobile
105|M|1004|email
106|F|1005|mobile
107|M|1006|email
108|F|1007|mobile

I want to create a Python MapReduce Hadoop Streaming Job to get the following output which is the Average Rating by Title by Gender我想创建一个 Python MapReduce Hadoop Streaming Job 以获得以下输出，即按性别按标题的平均评分

Animal Farm F   3.5
Pride and Prejudice M   2.5
The Hunger Games    M   3
To Kill a Mockingbird   F   1.5

I searched this forum and someone pointed out a solution but it is for 2 input files instead of 3. I gave it a go but am stuck at the mapper part because I am not able to sort it correctly so that the reducer can appropriately recognise the 1st record for Title & Gender, then start aggregating.我搜索了这个论坛，有人指出了一个解决方案，但它适用于 2 个输入文件而不是 3 个。我试了一下，但卡在映射器部分，因为我无法正确排序，以便减速器可以适当地识别标题和性别的第一条记录，然后开始汇总。 My mapper code below,我的映射器代码如下，

#!/usr/bin/env python
import sys
for line in sys.stdin:

    try:

        ReaderID = "-1"
        BookID = "-1"
        Title = "-1"
        Gender = "-1"
        Rating = "-1"

        line = line.strip()

        splits = line.split("|")

        if len(splits) == 2:
            BookID = splits[0]
            Title = splits[1]
        elif len(splits) == 3:
            ReaderID = splits[0]
            BookID = splits[1]
            Rating = splits[2]
        else:
            ReaderID = splits[0]
            Gender = splits[1]

        print('%s\t%s\t%s\t%s\t%s' % (BookID, Title, ReaderID, Rating, Gender))

    except:
        pass

PS: I need to use Python and Hadoop Streaming only. PS：我只需要使用 Python 和 Hadoop Streaming。 Not allowed to install Python packages like Dumbo, mrjob and etc.不允许安装 Dumbo、mrjob 等 Python 包。

Appreciate your help in advance.提前感谢您的帮助。

Thanks, Lobbie谢谢，洛比

Answer 1

Went through some core Java MR and all have suggested, the three files cannot be merged together in a single map job.浏览了一些核心 Java MR 并且所有人都建议，这三个文件不能在单个映射作业中合并在一起。 We have to first join the first two, and the resultant should be joined with the third one.我们必须先连接前两个，结果应该与第三个连接。 Applying your logic for the three, does not give me good result.将你的逻辑应用于这三个，并没有给我带来好的结果。 Hence, I tried with Pandas, and its seems to give promising result.因此，我尝试使用 Pandas，它似乎给出了有希望的结果。 If using pandas is not a constraint for you, please try my code.如果使用 pandas 对您没有限制，请尝试我的代码。 Else, we will try to join these three files with Python Dictionary and Lists.否则，我们将尝试将这三个文件与 Python 字典和列表连接起来。

Here is my suggested code.这是我建议的代码。 I have just concatenated all the input to test it.我刚刚连接了所有输入来测试它。 In you code, just comment my for loop (line #36) and un-comment your for loop (line #35).在您的代码中，只需注释我的 for 循环（第 36 行）并取消注释您的 for 循环（第 35 行）。

import pandas as pd
import sys

input_string_book = [
"1|The Hunger Games",
"2|To Kill a Mockingbird",
"3|Pride and Prejudice",
"4|Animal Farm"]
input_string_book_df = pd.DataFrame(columns=('BookID','Title'))


input_string_rating = [
"101|1|1",
"102|2|2",
"103|3|3",
"104|4|4",
"105|1|5",
"106|2|1",
"107|3|2",
"108|4|3"]
input_string_rating_df = pd.DataFrame(columns=('ReaderID','BookID','Rating'))


input_string_reader = [
"101|M|1000|email",
"102|F|1001|mobile",
"103|M|1002|email",
"104|F|1003|mobile",
"105|M|1004|email",
"106|F|1005|mobile",
"107|M|1006|email",
"108|F|1007|mobile"]
input_string_reader_df = pd.DataFrame(columns=('ReaderID','Gender','PostCode','PreferComms'))

#for line in sys.stdin:
for line in input_string_book + input_string_rating + input_string_reader:
    try:

        line = line.strip()

        splits = line.split("|")

        if len(splits) == 2:
            input_string_book_df = input_string_book_df.append(pd.DataFrame([[splits[0],splits[1]]],columns=('BookID','Title')))
        elif len(splits) == 3:
            input_string_rating_df = input_string_rating_df.append(pd.DataFrame([[splits[0],splits[1],splits[2]]],columns=('ReaderID','BookID','Rating')))
        else:
            input_string_reader_df = input_string_reader_df.append(pd.DataFrame([[splits[0],splits[1],splits[2],splits[3]]]
            ,columns=('ReaderID','Gender','PostCode','PreferComms')))

    except:
        raise

l_concat_1 = input_string_book_df.merge(input_string_rating_df,on='BookID',how='inner')

l_concat_2 = l_concat_1.merge(input_string_reader_df,on='ReaderID',how='inner')

for each_iter in l_concat_2[['BookID', 'Title', 'ReaderID', 'Rating', 'Gender']].iterrows():
    print('%s\t%s\t%s\t%s\t%s' % (each_iter[1][0], each_iter[1][1], each_iter[1][2], each_iter[1][3], each_iter[1][4]))

Output输出

1       The Hunger Games        101     1       M
1       The Hunger Games        105     5       M
2       To Kill a Mockingbird   102     2       F
2       To Kill a Mockingbird   106     1       F
3       Pride and Prejudice     103     3       M
3       Pride and Prejudice     107     2       M
4       Animal Farm     104     4       F
4       Animal Farm     108     3       F

需要 3 个输入文件的 Python MapReduce Hadoop 流作业？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-05-04 18:56:19

需要 3 个输入文件的 Python MapReduce Hadoop 流作业？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-05-04 18:56:19

解决方案1
2 已采纳 2016-05-04 18:56:19