简体   繁体   English

无法在AWS上使用流式python map-reduce通过stdin读取Hadoop Sequence文件

[英]Unable to read Hadoop Sequence files through stdin with a streaming python map-reduce on AWS

I am trying to run a simple word counting map-reduce job on Amazon's Elastic Map Reduce but the output is gibberish. 我正在尝试在Amazon的Elastic Map Reduce上运行一个简单的字数统计映射减少作业,但是输出乱码。 The input file is part of the common crawl files which are hadoop sequence files. 输入文件是通用爬网文件(hadoop序列文件)的一部分。 The file is supposed to be the extracted text (stripped from html) from the webpages that were crawled. 该文件应该是从已爬网的网页中提取的文本(从html剥离)。

My AWS Elastic MapReduce step looks like this: 我的AWS Elastic MapReduce步骤如下所示:

Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/

The job runs successfully, however the output is gibberish. 作业成功运行,但是输出乱码。 There are only weird symbols and no words at all. 只有怪异的符号,根本没有单词。 I am guessing this is because hadoop sequence files cannot be read through standard in? 我猜这是因为hadoop序列文件无法通过标准格式读取? However, how do you run a mr job on such a file? 但是,如何在这样的文件上运行Mr作业? Do we have to convert the sequence files into text files first? 我们是否必须先将序列文件转换为文本文件?

The first couple of lines from part-00000 look like this: 00000部分的前几行如下所示:

'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'

Here is my mapper: 这是我的映射器:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    words = line.split()
    for word in words:
      print word + "\t" + str(1)

And my reducer: 而我的减速器:

#!/usr/bin/env python

import sys

def output(previous_key, total):
    if previous_key != None:
      print previous_key + " was found " + str(total) + " times"

previous_key = None
total = 0

for line in sys.stdin:
    key, value = line.split("\t", 1)
    if key != previous_key:
      output(previous_key, total)
      previous_key = key
      total = 0 
    total += int(value)

output(previous_key, total)

There is nothing wrong with input file. 输入文件没有任何问题。 On a local machine I ran hadoop fs -text textData-00112 | less 在本地计算机上,我运行了hadoop fs -text textData-00112 | less hadoop fs -text textData-00112 | less and this returns pure text from web pages. hadoop fs -text textData-00112 | less ,这将从网页返回纯文本。 Any input on how to run a python streaming mapreduce job on these types of input files (common-crawl hadoop sequence files) is much appreciated. 非常感谢任何关于如何在这些类型的输入文件(通用抓取hadoop序列文件)上运行python流式mapreduce作业的输入。

You need to provide SequenceFileAsTextInputFormat as the inputformat to hadoop streaming jar. 您需要提供SequenceFileAsTextInputFormat作为hadoop流jar的输入inputformat

I have never used amazon aws mapreduce, but on a normal hadoop installation it would be done like this: 我从未使用过amazon aws mapreduce,但是在普通的hadoop安装中,它会像这样完成:

HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat

The suggestion by Sunny Nanda fixed the issue. Sunny Nanda的建议解决了这个问题。 Adding -inputformat SequenceFileAsTextInputFormat to the extra arguments box in the aws elastic mapreduce API worked and the output from the job is as expected. -inputformat SequenceFileAsTextInputFormat添加到可工作的aws elastic mapreduce API的附加参数框中,并且作业的输出符合预期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM