简体   繁体   中英

On demand slave generation in Hadoop cluster on EC2

I am planning to use Hadoop on EC2. Since we have to pay per instance usage, it is not good to have fixed number of instances than what are actually required for the job.

In our application, many jobs are executed concurrently and we do not know the slave requirement all the time. Is it possible to start the hadoop cluster with minimum slaves and later on manage the availability based on requirement?

ie create/destroy slaves on demand

Sub question: Can hadoop cluster manage multiple jobs concurrently?

Thanks

The default scheduler that is used in hadoop is a simple FIFO one, you can look into using FairScheduler which assigns a share of the cluster to each of the running jobs and has extensive configuration to control those shares.

As far as EC2 is concerned - you can easily start of with some number of nodes and then once you see that there are too many tasks in the queue and all the slots in the cluster are occupied - add more of them. You will simply have to start up an instance and launch a task tracker on it that will register with the jobtracker.

However you will have to have your own system that will manage startup and shutdown of these nodes.

Just want to let you know that we are doing some work on this in Apache Whirr . We are tracking progress in WHIRR-214 . Vote or join development. :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM