简体   繁体   中英

How to use AWS SQS with auto-scaling of spot instances EC2 workers and long running jobs?

Let me explain...

I have 2 SQS queues that receive requests for execution of light and heavy report-generating jobs. (The separation into two queues has been introduced for the light jobs not to be influenced by the heavy ones.)

The SQS sends the jobs in an auto-scaling group that contains 3 workers .

The workers are on-demand EC2 instances. I would like to change the launch configuration and use spot instances .

The thing is that some report-generating heavy duty jobs may run for up to 4 hours. So if this kind of job runs on a spot instance worker that may be terminated , additional delays and/or complications will arise.

I would like to use spot instances as workers but also to have the assurance that the worker will not be terminated if there is a job running on it.

The approaches I came up with are the following:

1. Bid for the spot instances with the on-demand price of the instance [it still does not protect from termination but minimises the possibility]

2. Use spot instances with specific period [eg 6 hours] , but still I am confined to 6 hours and the instance terminates. Plus, I dont know if I can set this kind of setting from the launch configuration

I would like to use spot instances as workers but also to have the assurance that the worker will not be terminated if there is a job running on it.

You seem to understand that this is not the way that spot instances work

They are yours until the price is out bid

The 6 hour thing ("defined duration") might help in some cases I suppose

Two ideas spring to mind

  • try and estimate the length of the job in the "long" queue before it starts. Then pick the cheapest option to run it

  • implement a transactional system for your jobs. For example when a job is pulled off the SQS add the time/instanceid/job id to another persisting system, ie a database table. Then have something poll the table every few minutes and check that the instanceid is still there. When the job finally successfully completes get the job runner to remove it from the database table. If the polling notices that the instance has gone away then resubmit the job to the SQS

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM