无法在AWS上使用流式python map-reduce通过stdin读取Hadoop Sequence文件

Question

我正在尝试在Amazon的Elastic Map Reduce上运行一个简单的字数统计映射减少作业，但是输出乱码。 输入文件是通用爬网文件（hadoop序列文件）的一部分。 该文件应该是从已爬网的网页中提取的文本（从html剥离）。

我的AWS Elastic MapReduce步骤如下所示：

Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/

作业成功运行，但是输出乱码。 只有怪异的符号，根本没有单词。 我猜这是因为hadoop序列文件无法通过标准格式读取？ 但是，如何在这样的文件上运行Mr作业？ 我们是否必须先将序列文件转换为文本文件？

00000部分的前几行如下所示：

'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'

这是我的映射器：

#!/usr/bin/env python

import sys

for line in sys.stdin:
    words = line.split()
    for word in words:
      print word + "\t" + str(1)

而我的减速器：

#!/usr/bin/env python

import sys

def output(previous_key, total):
    if previous_key != None:
      print previous_key + " was found " + str(total) + " times"

previous_key = None
total = 0

for line in sys.stdin:
    key, value = line.split("\t", 1)
    if key != previous_key:
      output(previous_key, total)
      previous_key = key
      total = 0 
    total += int(value)

output(previous_key, total)

输入文件没有任何问题。 在本地计算机上，我运行了hadoop fs -text textData-00112 | less hadoop fs -text textData-00112 | less ，这将从网页返回纯文本。 非常感谢任何关于如何在这些类型的输入文件（通用抓取hadoop序列文件）上运行python流式mapreduce作业的输入。

Answer 1

您需要提供SequenceFileAsTextInputFormat作为hadoop流jar的输入inputformat 。

我从未使用过amazon aws mapreduce，但是在普通的hadoop安装中，它会像这样完成：

HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat

Answer 2

Sunny Nanda的建议解决了这个问题。 将-inputformat SequenceFileAsTextInputFormat添加到可工作的aws elastic mapreduce API的附加参数框中，并且作业的输出符合预期。

无法在AWS上使用流式python map-reduce通过stdin读取Hadoop Sequence文件

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-01-19 11:06:39

解决方案2
0 2014-01-19 22:21:14

无法在AWS上使用流式python map-reduce通过stdin读取Hadoop Sequence文件

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-01-19 11:06:39

解决方案2 0 2014-01-19 22:21:14

解决方案1
1 已采纳 2014-01-19 11:06:39

解决方案2
0 2014-01-19 22:21:14