简体   繁体   中英

How to autoscale EMR task instances

I am using EMR with task instance groups as spot instances. I want to maintain minimum number of task instances always. Means, whenever EMR terminates task instances because of bid price goes higher than what we set, my application should launch another task instance with little higher bid price.

My research-

  1. Use Cloudwatch to inform when it breaches threshhold, and auto-scale task instances. But as per study, there is no concept of auto-scaling in EMR.
  2. Use Cloudwatch, and notify SQS when threshhold breahes, and there is one service who is always consuming and expand task instances.

Questions

  1. Is there any auto-scaling present in EMR ? If that is available, then my efforts will reduce to just set threshhold, and corresponding expansion task instances action.
  2. If you have any other approach to solve this problem, please suggest.

How Spot Prices Work

When an Amazon EC2 instance is launched with a spot price (including when launched from Amazon EMR), the instance will start if the current spot price is below the provided bid price . If the spot price rises above the bid price, the instance is terminated. Instances are only charged the current spot price .

Therefore, the logic of launching a new spot instance with a "little higher bid price" is not necessary. The instance will always be charged the current spot price , so simply bid as high as you are willing to pay for a spot instance. You will either pay less than the spot price (great!) or your instance will be terminated because the price has gone higher than you are willing to pay (in which case you don't want to pay a "little higher" for the instance).

If you wish to "maintain minimum number of task instances" at all times, then either pay the normal EMR charge (which means the instances won't be terminated) or bid a particularly large price for the spot instances, such as 2 x the normal price . Yes, you might occasionally pay more for instances, but on average your price will be quite low.

If you wish to be particularly sneaky, you could bid up to the normal price for the EC2 instances then, if instances are terminated, launch more task nodes without using spot pricing. That way, your instances won't be terminated and you won't pay more than the normal EC2 price. However, you would have to terminate and replace those instances when the spot price drops , otherwise you are paying too much. That's why it might be better just to provide a high bid price on your spot instances.

Bottom line: Use spot pricing, but bid a high price. You'll get a good price most of the time.

AWS EMR does not have a autoscaling option available. But you can use a work around and integrate Autoscaling using AWS SQS. This is a rough picture what you can integrate.

  1. Launch you EMR cluster using spot instance.
  2. Set up a SQS Queue and create 3 triggers one for CPU threshold , second for EC2 spot instance termination notice and third for changing the spot instance bid prices.
  3. So if the CPU usage increases SQS will trigger an event to launch a new instance to cluster, if there is spot instance termination notice SQS will trigger to launch another instance to balance the load and send a event to change the bid price to launch another spot instance. (This is just rough sketch but I guess you will understand the logic.

This is guide to AWS SQS Autoscaling.

https://docs.aws.amazon.com/autoscaling/latest/userguide/as-using-sqs-queue.html

As has been correctly pointed, the EMR API provides all necessary ingredients to 1) collect monitoring data, and 2) programmatically scale the cluster up and down.

Basically, there are two main options to implement autoscaling for EMR clusters:

  1. Autoscaling Loop: A process that is running on a server and continuously monitors the cluster for its current load. Performance metrics (memory, CPU, I/O, etc) can be collected in regular intervals and stored in a database. Autoscaling rules are evaluated against the performance metrics, and the cluster's task nodes are scaled up or down if required.
  2. Event-Based Autoscaling: Using CloudWatch metrics (eg, metrics for EMR or EC2 ), you can programmatically define triggers that are fired under certain conditions (for instance, add nodes if average CPUUtilization of all nodes exceeds 80%).

Both options have their pros and cons. The main advantage of option 2 is that it is a server-less approach (does not require to run your own server). Option 1, on the other hand, does require a server, but therefore comes with more control to customize the logic of your scaling rules. Also, it allows to keep searchable records of the history of the scaling decisions.

You could take a look at Themis , an EMR autoscaling framework developed at Atlassian. Themis implements the autoscaling loop as discussed in option 1 above. Current features include proactive as well as reactive autoscaling, support for spot/on-demand task nodes, it comes with a Web UI, and the tool is very easy to configure.

I have had a similar problem, and I wanted to share one possible alternative. I have written a Java tool to dynamically resize an EMR cluster during the processing. It might help you. Check it out at:

http://www.lopakalogic.com/articles/hadoop-articles/dynamically-resize-emr/

The source code is available on Github

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM