简体繁体 English

迭代MapReduce

[英]Iterative MapReduce

原文 2010-12-27 08:18:39 0 4 python/ streaming/ hadoop/ mapreduce/ iteration

I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). 我为Hadoop编写了一个简单的k-means聚类代码（两个独立的程序 - mapper和reducer）。 The code is working over a small dataset of 2d points on my local box. 代码正在我本地方框上的2d点的小数据集上工作。 It's written in Python and I plan to use Streaming API. 它是用Python编写的，我计划使用Streaming API。

I would like suggestions on how best to run this program on Hadoop. 我想了解如何最好地在Hadoop上运行该程序。

After each run of mapper and reducer, new centres are generated. 每次运行mapper和reducer后，都会生成新的中心。 These centres are input for the next iteration. 这些中心是下一次迭代的输入。

From what I can see, each mapreduce iteration will have to be a separate mapreduce job. 从我所看到的，每个mapreduce迭代都必须是一个单独的mapreduce工作。 And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper. 看起来我必须编写另一个脚本（python / bash），以便在每个reduce阶段之后从HDFS中提取新的中心，并将其反馈给mapper。

Any other easier, less messier way? 还有其他更简单，更简单的方法吗？ If the cluster happens to use a fair scheduler, It will be very long before this computation completes? 如果集群碰巧使用公平的调度程序，那么在此计算完成之前很长时间？

4 个解决方案

You needn't write another job. 你不需要写另一份工作。 You can put the same job in a loop ( a while loop) and just keep changing the parameters of the job, so that when the mapper and reducer complete their processing, the control starts with creating a new configuration, and then you just automatically have an input file that is the output of the previous phase. 您可以将相同的作业放在循环中（while循环）并只是不断更改作业的参数，这样当映射器和缩减器完成其处理时，控件从创建新配置开始，然后您自动拥有输入文件，是前一阶段的输出。

The Java interface of Hadoop has the concept of chaining several jobs: http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining Hadoop的Java接口具有链接多个作业的概念： http ： //developer.yahoo.com/hadoop/tutorial/module4.html#chaining

However, since you're using Hadoop Streaming you don't have any support for chaining jobs and managing workflows. 但是，由于您使用的是Hadoop Streaming，因此您无法支持链接作业和管理工作流。

You should checkout Oozie which should do the job for you: http://yahoo.github.com/oozie/ 你应该结识Oozie哪个应该为你做这个工作：http： //yahoo.github.com/oozie/

Here are a few ways to do it: github.com/bwhite/hadoop_vision/tree/master/kmeans 以下是一些方法：github.com/bwhite/hadoop_vision/tree/master/kmeans

Also check this out (has oozie support): http://bwhite.github.com/hadoopy/ 另外检查一下（有oozie支持）： http ： //bwhite.github.com/hadoopy/

Feels funny to be answering my own question. 回答我自己的问题感觉很有趣。 I used PIG 0.9 (not released yet, but available in the trunk). 我使用了PIG 0.9（尚未发布，但在主干中可用）。 In this, there is support for modularity and flow control by way of allowing PIG Statements to be embedded inside scripting languages like Python. 在此，通过允许PIG语句嵌入Python等脚本语言中，支持模块化和流控制。

So, I wrote a main python script that had a loop, and inside that called my PIG Scripts. 所以，我写了一个有一个循环的主python脚本，里面有一个叫做我的PIG脚本。 The PIG scripts inturn made calls to the UDFs. PIG脚本内部调用了UDF。 So, had to write three different programs. 所以，不得不写三个不同的程序。 But it worked out fine. 但它很好。

You can check the example here - http://www.mail-archive.com/user@pig.apache.org/msg00672.html 您可以在此处查看示例 - http://www.mail-archive.com/user@pig.apache.org/msg00672.html

For the record, my UDFs were also written in Python, using this new feature that allows writing UDFs in scripting languages. 为了记录，我的UDF也是用Python编写的，使用这个允许用脚本语言编写UDF的新功能。