I'm running a Java job that start AWS EMR and run steps on it. After I add a step to the EMR I call the listSteps
function to get the status of the steps and wait until they all done/failed.
I noticed that sometimes the function listSteps
doesn't included the last step I added if I call it right after I added it. Which makes me think that all the steps are done while actually the latest step didn't even started.
listSteps
? I'm use the "AmazonElasticMapReduceClient" class from Amazon SDK.
I don't think there is a magic workaround for this kind of problem. Many of AWS calls are asynchronous. For example, launching an EC2 machine will return right away, and then you must poll to see if the instance is up yet. I think with a bit of design, it won't be much of an issue. I see several options:
When you create the cluster and add the job steps, you know how many job steps, and which job steps you're adding to the cluster, so you can start a new thread and monitor the cluster for all steps being added (in psuedocode):
function createCluster(steps, callback):
aws.runJobFlow(...)
on new thread:
while(steps != aws.listSteps(...)):
sleep()
callback()
Then all you have to do in your status check (to see if job has finished) is to call listSteps()
and check the status. That's probably the simplest solution to the problem.
The other design option is that you have a job step that notifies your software of progress or successful completion of the job. This design option would be asynchronous and wouldn't require polling. For example, create a job step called notify
. Then you run your steps like
Each notify step can listSteps() on the job flow to see the result of the previous steps and update a database, send a message to a service, or update a cache with the progress of the job.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.