如何使用python在hadoop流媒体作业中使用文件？

Question

I want to read a list from a file in my hadoop streaming job. 我想从我的hadoop流媒体作业中的文件中读取一个列表。 Here is my simple mapper.py: 这是我简单的mapper.py：

#!/usr/bin/env python

import sys
import json

def read_file():
    id_list = []
    #read ids from a file
    f = open('../user_ids','r')
    for line in f:
        line = line.strip()
        id_list.append(line)
    return id_list

if __name__ == '__main__':
    id_list = set(read_file())
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        line = json.loads(line)
        user_id = line['user']['id']
        if str(user_id) in id_list:
            print '%s\t%s' % (user_id, line)

and here is my reducer.py 这是我的reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

current_id = None
current_list = []
id = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    id, line = line.split('\t', 1)

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_id == id:
        current_list.append(line)
    else:
        if current_id:
            # write result to STDOUT
            print '%s\t%s' % (current_id, current_list)
        current_id = id
        current_list = [line]

# do not forget to output the last word if needed!
if current_id == id:
        print '%s\t%s' % (current_id, current_list)

now to run it I say: 现在运行它我说：

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
    -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
    -input test/input.txt  -output test/output -file '../user_ids'

The job starts to run: 工作开始运行：

13/11/07 05:04:52 INFO streaming.StreamJob:  map 0%  reduce 0%
13/11/07 05:05:21 INFO streaming.StreamJob:  map 100%  reduce 100%
13/11/07 05:05:21 INFO streaming.StreamJob: To kill this job, run:

I get the error: 我收到错误：

job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.         LastFailedTask: task_201309172143_1390_m_000001
13/11/07 05:05:21 INFO streaming.StreamJob: killJob...

I when I do not read the ids from the file ../user_ids it does not give me any errors. 当我没有从文件中读取ID时../user_ids它不会给我任何错误。 I think the problem is it can not find my ../user_id file. 我认为问题是它无法找到我的../user_id文件。 I also have used the location in hdfs and still did not work. 我也使用了hdfs中的位置，仍然无法正常工作。 Thanks for your help. 谢谢你的帮助。

Answer 1

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
  -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
  -input test/input.txt  -output test/output -file '../user_ids'

Does ../user_ids exist on your local file path when you execute the job? 执行作业时，本地文件路径中是否存在../user_ids？ If it does then you need to amend your mapper code to account for the fact that this file will be available in the local working directory of the mapper at runtime: 如果确实如此，那么您需要修改映射器代码以解释此文件在运行时在映射器的本地工作目录中可用的事实：

f = open('user_ids','r')

Answer 2

尝试提供文件的完整路径或在执行hadoop命令时确保您位于文件user_ids文件所在的同一目录中

如何使用python在hadoop流媒体作业中使用文件？

问题描述

2 个解决方案

解决方案1
11 已采纳 2013-11-07 11:35:01

解决方案2
1 2014-09-29 21:49:33

如何使用python在hadoop流媒体作业中使用文件？

问题描述

2 个解决方案

解决方案1 11 已采纳 2013-11-07 11:35:01

解决方案2 1 2014-09-29 21:49:33

解决方案1
11 已采纳 2013-11-07 11:35:01

解决方案2
1 2014-09-29 21:49:33