在Hadoop Streaming中使用Python使用文件

Question

I am completely new to Hadoop and MapReduce and am trying to work my way through it. 我是Hadoop和MapReduce的新手，正在尝试自己的工作方式。 I am trying to develop a mapreduce application in python, in which I use data from 2 .CSV files. 我正在尝试在python中开发一个mapreduce应用程序，其中使用了2个.CSV文件中的数据。 I am just reading the two files in mapper and then printing the key value pair from the files to the sys.stdout 我只是在mapper中读取两个文件，然后将文件中的键值对打印到sys.stdout

The program runs fine when I use it on a single machine, but with the Hadoop Streaming, I get an error. 当我在单台机器上使用该程序时，它运行良好，但是使用Hadoop Streaming时，出现错误。 I think I am making some mistake in reading files in the mapper on Hadoop. 我认为在Hadoop上的映射器中读取文件时犯了一些错误。 Please help me out with the code, and tell me how to use file-handling in Hadoop Streaming. 请帮助我提供代码，并告诉我如何在Hadoop流中使用文件处理。 The mapper.py code is as below. mapper.py代码如下。 (You can understand the code from the comments): （您可以从注释中了解代码）：

#!/usr/bin/env python
import sys
from numpy import genfromtxt

def read_input(inVal):
    for line in inVal:
        # split the line into words
        yield line.strip()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    labels=[]
    data=[]    
    incoming = read_input(sys.stdin)
    for vals in incoming:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited;
        if len(vals) > 10:
            data.append(vals)
        else:
            labels.append(vals)

    for i in range(0,len(labels)):
        print "%s%s%s\n" % (labels[i], separator, data[i])


if __name__ == "__main__":
    main()

There are 60000 records which are entered to this mapper from two .csv files as follows (on single machine, not hadoop cluster): 如下所示，从两个.csv文件向此映射器输入了60000条记录（在单台计算机上，而不是hadoop集群上）：

cat mnist_train_labels.csv mnist_train_data.csv | ./mapper.py

Answer 1

I was able to resolve the issue after searching a solution for like 3 days. 搜索解决方案大约3天后，我能够解决该问题。

The problem is with the newer version of Hadoop (2.2.0 in my case). 问题在于较新版本的Hadoop（在我的案例中为2.2.0）。 The mapper code, when reading values from files was giving an exit code of non-zero at some point (maybe because it was reading a huge list of values(784) at a time). 当从文件中读取值时，映射器代码在某个时候给出的退出代码为非零（可能是因为它一次读取了一大堆值（784））。 There is a setting in the Hadoop 2.2.0, which tells the Hadoop System to give a general error (subprocess failed with code 1). Hadoop 2.2.0中有一个设置，该设置告诉Hadoop系统给出一般错误（子流程失败，代码1）。 This setting is set to True by default. 默认情况下，此设置设置为True。 I just had to set the value of this property to False, and it made my code run without any errors. 我只需要将此属性的值设置为False，就可以使我的代码运行无任何错误。

Setting is: stream.non.zero.exit.is.failure . 设置为： stream.non.zero.exit.is.failure 。 Just set it to false when streaming. 只需在流式传输时将其设置为false。 So the streaming command would be somewhat like: 因此，流式传输命令将类似于：

**hadoop jar ... -D stream.non.zero.exit.is.failure=false ...**

Hope it helps someone, and saves 3 days... ;) 希望它可以帮助某人，并节省3天...;）

Answer 2

You didn't post your error. 您没有发布错误。 In streaming you need to pass the -file argument or a -input , so that the file is either uploaded with your streaming job or knows where to find it on hdfs. 在流式传输中，您需要传递-file参数或-input，以便文件随流式传输作业一起上传，或者知道在hdfs上的哪里可以找到它。

在Hadoop Streaming中使用Python使用文件

问题描述

2 个解决方案

解决方案1
3 已采纳 2014-04-17 23:30:08

解决方案2
0 2015-06-04 19:57:53

在Hadoop Streaming中使用Python使用文件

问题描述

2 个解决方案

解决方案1 3 已采纳 2014-04-17 23:30:08

解决方案2 0 2015-06-04 19:57:53

解决方案1
3 已采纳 2014-04-17 23:30:08

解决方案2
0 2015-06-04 19:57:53