Broken Pipe 错误导致 AWS 上的流式 Elastic MapReduce 作业失败

Question

Everything works fine locally when I do as follows:当我执行以下操作时，在本地一切正常：

cat input | python mapper.py | sort | python reducer.py

However, when I run the streaming MapReduce job on AWS Elastic Mapreduce, the job does not complete successfully.但是，当我在 AWS Elastic Mapreduce 上运行流式 MapReduce 作业时，该作业没有成功完成。 The mapper.py runs part way through (I know this because of writing to stderr along the way). mapper.py运行到一半（我知道这是因为一路写入stderr ）。 The mapper is interrupted by a "Broken Pipe" error, which I'm able to retrieve from the syslog of the task attempt after it fails:映射器被“Broken Pipe”错误中断，我可以在任务尝试失败后从系统日志中检索该错误：

java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:282)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:109)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)


2012-03-26 07:19:05,400 WARN org.apache.hadoop.streaming.PipeMapRed (main): java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:282)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.DataOutputStream.flush(DataOutputStream.java:106)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:124)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

2012-03-26 07:19:05,400 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished
2012-03-26 07:19:05,400 WARN org.apache.hadoop.streaming.PipeMapRed (main): java.io.IOException: Bad file descriptor
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:282)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.DataOutputStream.flush(DataOutputStream.java:106)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:579)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

2012-03-26 07:19:05,400 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished
2012-03-26 07:19:05,405 INFO org.apache.hadoop.streaming.PipeMapRed (Thread-13): MRErrorThread done
2012-03-26 07:19:05,408 INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-03-26 07:19:05,519 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2012-03-26 07:19:05,520 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Got UserName hadoop for UID 106 from the native implementation
2012-03-26 07:19:05,522 WARN org.apache.hadoop.mapred.Child (main): Error running child
java.io.IOException: log:null
R/W/S=7018/3/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |text/html    1|
Date: Mon Mar 26 07:19:05 UTC 2012
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:282)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:109)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)


    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:125)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
2012-03-26 07:19:05,525 INFO org.apache.hadoop.mapred.Task (main): Runnning cleanup for the task
2012-03-26 07:19:05,526 INFO org.apache.hadoop.mapred.DirectFileOutputCommitter (main): Nothing to clean up on abort since there are no temporary files written

Here is mapper.py .这是mapper.py 。 Note that I write to stderr to provide myself with debugging info:请注意，我写信给 stderr 是为了向自己提供调试信息：

#!/usr/bin/env python

import sys
from warc import ARCFile

def main():
    warc_file = ARCFile(fileobj=sys.stdin)
    for web_page in warc_file:
        print >> sys.stderr, '%s\t%s' % (web_page.header.content_type, 1) #For debugging
        print '%s\t%s' % (web_page.header.content_type, 1)
    print >> sys.stderr, 'done' #For debugging
if __name__ == "__main__":
    main()

Here is what I get in the stderr for the task attempt when the mapper.py is run:这是我在运行 mapper.py 时在任务尝试的 stderr 中得到的信息：

text/html   1
text/html   1
text/html   1

Basically, the loop runs through 3 times and then stops abruptly without python throwing any error.基本上，循环运行 3 次，然后突然停止，python 不会抛出任何错误。 (Note: it should be outputting thousands of lines). （注意：它应该输出数千行）。 Even an uncaught exception should appear in stderr.即使未捕获的异常也应该出现在 stderr 中。

Because the MapReduce runs completely fine on my local computer, my guess is that this is a problem with how Hadoop is dealing with the output I'm printing from mapper.py.因为 MapReduce 在我的本地计算机上运行完全正常，我猜测这是 Hadoop 如何处理我从 mapper.py 打印的输出的问题。 But I'm clueless as to what the problem could be.但我对问题可能是什么一无所知。

Answer 1

Your streaming process (your Python script) is terminating prematurely.您的流式处理（您的 Python 脚本）过早终止。 This may be do to it thinking input is complete (eg interpreting an EOF) or a swallowed exception.这可能是因为它认为输入是完整的（例如解释 EOF）或吞下异常。 Either way, Hadoop is trying to feed into via STDIN to your script, but since the application has terminated (and thus STDIN is no longer a valid File Descriptor), you're getting a BrokenPipe error.无论哪种方式，Hadoop 都试图通过 STDIN 输入您的脚本，但由于应用程序已终止（因此 STDIN 不再是有效的文件描述符），您会收到 BrokenPipe 错误。 I would suggest adding stderr traces in your script to see what line of input is causing the problem.我建议在您的脚本中添加 stderr 跟踪以查看导致问题的输入行。 Happy coding,快乐编码，

-Geoff -杰夫

Answer 2

This is said in the accepted error, but let me attempt to clarify--you must block on stdin, even if you don't need it!这是在接受的错误中说的，但让我尝试澄清一下 - 即使您不需要它，您也必须阻止标准输入！ This is not the same as Linux pipes, so don't let that fool you.这是不一样的Linux管道，所以不要让这种欺骗你。 What happens, intuitively, is, Streaming stands up your executable, then says, "wait here while I go get input for you".直觉上，Streaming 会支持你的可执行文件，然后说，“等着我去为你获取输入”。 If your executable stops for any reason before Streaming sends you 100% of the input, Streaming says, "Hey, where did that executable go that I stood up?...Hmmmm...the pipe is broken, let me raise that exception!"如果您的可执行文件在 Streaming 向您发送 100% 的输入之前因任何原因停止，Streaming 会说，“嘿，那个可执行文件去了哪里，我站起来了？......嗯......管道坏了，让我提出那个例外！” So, here is some python code, all it does is what cat does, but you'll note, this code won't exit until all input is processed, and that is the key point:所以，这里有一些 python 代码，它所做的就是 cat 所做的，但你会注意到，在处理完所有输入之前，这段代码不会退出，这是关键点：

#!/usr/bin/python
import sys

while True:
    s = sys.stdin.readline()
    if not s:
        break
    sys.stdout.write(s)

Answer 3

I have no experience with Hadoop on AWS but I had the same error on a regular hadoop cluster - and in my case the problem was how I started python -mapper ./mapper.py -reducer ./reducer.py worked but -mapper python mapper.py didn't.我没有在 AWS 上使用 Hadoop 的经验，但是我在常规的 hadoop 集群上遇到了同样的错误 - 就我而言，问题是我如何启动 python -mapper ./mapper.py -reducer ./reducer.py工作但-mapper python mapper.py没有。

You also seem to use a non-standard python package warc do you submit the necessary files to the streamjob?你似乎也使用了一个非标准的 python 包warc你是否向流作业提交了必要的文件？ -cacheFiles or -cacheArchive could be helpful. -cacheFiles或-cacheArchive可能会有所帮助。

Broken Pipe 错误导致 AWS 上的流式 Elastic MapReduce 作业失败

问题描述

3 个解决方案

解决方案1
11 已采纳 2012-03-29 06:32:54

解决方案2
8 2014-08-14 16:57:26

解决方案3
1 2012-03-29 16:59:50

Broken Pipe 错误导致 AWS 上的流式 Elastic MapReduce 作业失败

问题描述

3 个解决方案

解决方案1 11 已采纳 2012-03-29 06:32:54

解决方案2 8 2014-08-14 16:57:26

解决方案3 1 2012-03-29 16:59:50

解决方案1
11 已采纳 2012-03-29 06:32:54

解决方案2
8 2014-08-14 16:57:26

解决方案3
1 2012-03-29 16:59:50