简体繁体 English

EC2上的Hadoop集群中的按需从属生成

[英]On demand slave generation in Hadoop cluster on EC2

原文 2010-08-15 10:09:15 8 3 amazon-ec2/ hadoop/ mapreduce

I am planning to use Hadoop on EC2. 我计划在EC2上使用Hadoop。 Since we have to pay per instance usage, it is not good to have fixed number of instances than what are actually required for the job. 由于我们必须按实例使用量付费，因此拥有固定数量的实例比完成工作所需的实例数量不好。

In our application, many jobs are executed concurrently and we do not know the slave requirement all the time. 在我们的应用程序中，许多作业是同时执行的，我们一直都不知道从属要求。 Is it possible to start the hadoop cluster with minimum slaves and later on manage the availability based on requirement? 是否可以使用最少的从属服务器启动hadoop集群，然后再根据需要管理可用性？

ie create/destroy slaves on demand 即按需创建/销毁奴隶

Sub question: Can hadoop cluster manage multiple jobs concurrently? 子问题：hadoop集群可以同时管理多个作业吗？

Thanks 谢谢

3 个解决方案

The default scheduler that is used in hadoop is a simple FIFO one, you can look into using FairScheduler which assigns a share of the cluster to each of the running jobs and has extensive configuration to control those shares. hadoop中使用的默认调度程序是一个简单的FIFO，您可以使用FairScheduler进行研究，该程序将群集的份额分配给每个正在运行的作业，并具有广泛的配置来控制这些份额。

As far as EC2 is concerned - you can easily start of with some number of nodes and then once you see that there are too many tasks in the queue and all the slots in the cluster are occupied - add more of them. 就EC2而言-您可以轻松地从一定数量的节点开始，然后一旦发现队列中的任务太多并且集群中的所有插槽都被占用-则添加更多它们。 You will simply have to start up an instance and launch a task tracker on it that will register with the jobtracker. 您只需要启动一个实例并在其上启动一个任务跟踪器即可向jobtracker注册。

However you will have to have your own system that will manage startup and shutdown of these nodes. 但是，您将必须拥有自己的系统来管理这些节点的启动和关闭。

这似乎很有希望http://hadoop.apache.org/common/docs/r0.17.1/hod.html

Just want to let you know that we are doing some work on this in Apache Whirr . 只想让您知道我们正在Apache Whirr中为此做一些工作。 We are tracking progress in WHIRR-214 . 我们正在跟踪WHIRR-214的进展。 Vote or join development. 投票或参与开发。 :) :)