简体繁体中英

Chaining multiple mapreduce tasks in Hadoop streaming

原文 2011-01-07 14:18:39 9 4 python/ hadoop/ mapreduce/ hadoop-plugins

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

4 answers

Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading

You can try out Yelp's MRJob to get your job done.. Its an opensource MapReduce Library that allows you to write chained jobs that can be run atop Hadoop Streaming on your Hadoop Cluster or EC2.. Its pretty elegant and easy to use, and has a method called steps which you can override to specify the exact chain of mappers and reducers that you want your data to go through.

Checkout the source at https://github.com/Yelp/mrjob
and documentation at http://packages.python.org/mrjob/

Typically the way I do this with Hadoop streaming and Python is from within my bash script that I create to run the jobs in the first place. Always I run from a bash script, this way I can get emails on errors and emails on success and make them more flexible passing in parameters from another Ruby or Python script wrapping it that can work in a larger event processing system.

So, the output of the first command (job) is the input to the next command (job) which can be variables in your bash script passed in as an argument from the command line (simple and quick)

You might want to checkout Oozie http://yahoo.github.com/oozie/design.html a workflow engine for Hadoop that will help to-do this also (supports streaming, not a problem). I did not have this when I started so I ended up having to build my own thing but this is a kewl system and useful!!!!

If you are already writing your mapper and reducer in Python, I would consider using Dumbo where such an operation is straightforward. The sequence of your map reduce jobs, your mapper, reducer etc. are all in one python script that can be run from the command line.

multiple file output in hadoop mapreduce streaming

hadoop: tracking MapReduce tasks

Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

Python MapReduce Hadoop Streaming Job that requires multiple input files?

error while executing python mapreduce tasks in hadoop?

How to run Python mapreduce in Hadoop Streaming

Hadoop MapReduce Streaming output different from the output of running MapReduce locally

Celery: Chaining tasks with multiple arguments

Python MapReduce Hadoop Streaming Job that requires 3 input files?

python streaming mapreduce job on hadoop failed - missing log4j?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question multiple file output in hadoop mapreduce streaming hadoop: tracking MapReduce tasks Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python) Python MapReduce Hadoop Streaming Job that requires multiple input files? error while executing python mapreduce tasks in hadoop? How to run Python mapreduce in Hadoop Streaming Hadoop MapReduce Streaming output different from the output of running MapReduce locally Celery: Chaining tasks with multiple arguments Python MapReduce Hadoop Streaming Job that requires 3 input files? python streaming mapreduce job on hadoop failed - missing log4j?

Related Tags

Chaining multiple mapreduce tasks in Hadoop streaming

Question

4 answers

solution1
4 ACCPTED 2011-01-07 17:43:42

solution2
3 2011-02-12 20:40:21

solution3
1 2011-01-07 17:17:45

solution4
1 2011-02-12 20:28:50

Chaining multiple mapreduce tasks in Hadoop streaming

Question

4 answers

solution1 4 ACCPTED 2011-01-07 17:43:42

solution2 3 2011-02-12 20:40:21

solution3 1 2011-01-07 17:17:45

solution4 1 2011-02-12 20:28:50

solution1
4 ACCPTED 2011-01-07 17:43:42

solution2
3 2011-02-12 20:40:21

solution3
1 2011-01-07 17:17:45

solution4
1 2011-02-12 20:28:50