简体   繁体   English

如何减少胶水etl作业(火花)实际开始执行所花费的时间?

[英]How to reduce the time taken by the glue etl job(spark) to actually start executing?

I want to start a glue etl job, though the execution is fair (time concerns), however, the time taken by glue to actually start executing the job is too much. 我想开始执行胶粘etl作业,尽管执行是公平的(与时间有关),但是,胶粘实际开始执行该作业所花费的时间太多。

I looked into various documentation and answers but none of them could give me the solution. 我研究了各种文档和答案,但是没有一个可以给我解决方案。 There was some explanation of this behavior: cold start but no solution. 对此行为有一些解释:冷启动但没有解决方案。

I expect to have the job up asap, it takes sometimes around 10 mins to start a job which gets executed in 2 mins. 我希望尽快完成工作,有时大约需要10分钟才能开始工作,但要在2分钟后执行。

Unfortunately it's not possible now. 不幸的是,现在不可能了。 Glue uses EMR under the hood and it requires some time to spin up a new cluster with desired number of executors. Glue在后台使用EMR,它需要一些时间来启动具有所需执行程序数量的新集群。 As far as I know they have a pool of spare EMR clusters with some most common DPU configurations so if you are lucky your job can get one and start immediately, otherwise it will wait. 据我所知,它们有一组备用的EMR群集,这些群集具有一些最常见的DPU配置,因此,如果您幸运的话,您的工作可以立即获得开始并立即开始,否则它将等待。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM