简体繁体 English

在Hadoop流中链接多个mapreduce任务

[英]Chaining multiple mapreduce tasks in Hadoop streaming

原文 2011-01-07 14:18:39 1 4 python/ hadoop/ mapreduce/ hadoop-plugins

I am in scenario where I have two mapreduce jobs. 我在我有两个mapreduce工作的情况下。 I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. 我更熟悉python并计划用它来编写mapreduce脚本并使用hadoop流。 is there a convenient to chain both the jobs following form when hadoop streaming is used? 当使用hadoop流时，是否可以方便地将两个作业链接起来？

Map1 -> Reduce1 -> Map2 -> Reduce2 Map1 - > Reduce1 - > Map2 - > Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming. 我在java中听说过很多方法可以实现这一点，但是我需要Hadoop流的东西。

4 个解决方案

Here is a great blog post on how to use Cascading and Streaming. 这是一篇关于如何使用Cascading和Streaming的精彩博文。 http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/ http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. 这里的值是你可以在同一个应用程序中混合使用java（级联查询流）和自定义流操作。 I find this much less brittle than other methods. 我觉得这比其他方法要脆弱得多。

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow). 注意，Cascading中的Cascade对象允许您链接多个Flow（通过上面的博客文章，您的Streaming作业将成为MapReduceFlow）。

Disclaimer: I'm the author of Cascading 免责声明：我是Cascading的作者

You can try out Yelp's MRJob to get your job done.. Its an opensource MapReduce Library that allows you to write chained jobs that can be run atop Hadoop Streaming on your Hadoop Cluster or EC2.. Its pretty elegant and easy to use, and has a method called steps which you can override to specify the exact chain of mappers and reducers that you want your data to go through. 您可以尝试Yelp的MRJob来完成您的工作。它是一个开源MapReduce库，允许您编写可以在Hadoop集群或EC2上的Hadoop Streaming上运行的链式作业。它非常优雅且易于使用，并且具有一个名为steps的方法，您可以覆盖该方法以指定您希望数据通过的映射器和缩减器的确切链。

Checkout the source at https://github.com/Yelp/mrjob 在https://github.com/Yelp/mrjob上查看源代码
and documentation at http://packages.python.org/mrjob/ 和http://packages.python.org/mrjob/上的文档

Typically the way I do this with Hadoop streaming and Python is from within my bash script that I create to run the jobs in the first place. 通常，我使用Hadoop流和Python执行此操作的方式来自我创建的bash脚本，以便首先运行作业。 Always I run from a bash script, this way I can get emails on errors and emails on success and make them more flexible passing in parameters from another Ruby or Python script wrapping it that can work in a larger event processing system. 总是我从一个bash脚本运行，这样我就能收到有关成功的错误和电子邮件的电子邮件，并使它们更灵活地从另一个包含它的Ruby或Python脚本中传递参数，这些脚本可以在更大的事件处理系统中工作。

So, the output of the first command (job) is the input to the next command (job) which can be variables in your bash script passed in as an argument from the command line (simple and quick) 因此，第一个命令（作业）的输出是下一个命令（作业）的输入，它可以是作为参数从命令行传入的bash脚本中的变量（简单快捷）

You might want to checkout Oozie http://yahoo.github.com/oozie/design.html a workflow engine for Hadoop that will help to-do this also (supports streaming, not a problem). 您可能想要查看Oozie http://yahoo.github.com/oozie/design.html Hadoop的工作流引擎，这也有助于实现此目的（支持流式传输，而不是问题）。 I did not have this when I started so I ended up having to build my own thing but this is a kewl system and useful!!!! 我开始时没有这个，所以我最终不得不建立自己的东西，但这是一个kewl系统，很有用!!!!

If you are already writing your mapper and reducer in Python, I would consider using Dumbo where such an operation is straightforward. 如果您已经在Python中编写mapper和reducer，我会考虑使用Dumbo这样的操作很简单。 The sequence of your map reduce jobs, your mapper, reducer etc. are all in one python script that can be run from the command line. 地图减少作业，mapper，reducer等的序列都在一个可以从命令行运行的python脚本中。