简体   繁体   English

使用MRJob将作业提交到EMR群集

[英]Submit jobs to EMR cluster using MRJob

MRJob waits until each job completes before giving back control to the user. MRJob等待直到每个作业完成,然后再将控制权交还给用户。 I broke down a large EMR step into smaller ones and would like to submit them all in one shot. 我将一个较大的EMR步骤分解为较小的步骤,并希望一次提交所有这些内容。

The docs talk about programmatically submitting tasks , but the sample code also waits for job completion (since they call the runner.run() command which blocks until the job is complete ). 文档讨论了以编程方式提交任务 ,但是示例代码还等待作业完成(因为它们调用了runner.run()命令,该命令会阻塞直到作业完成 )。

Also EMR has a limitation of 256 Active jobs, yet, how do we go about filling up those 256 jobs rather than looping and getting the output on the attached console. 同样,EMR的活动限制为256个活动,但是,我们该如何去填充这256个活动,而不是循环并在连接的控制台上获取输出。

After days of trying, the following is the best I could come up with. 经过数天的尝试,以下是我能想到的最好的方法。

My Initial Attempt, when I realised that a submitted job doesnt get culled when the terminal is detached, was to (in a bash script) submit and kill jobs. 我的最初尝试是,当我意识到终端分离后提交的作业不会被淘汰时,我是(以bash脚本的形式)提交并杀死作业。 However, that didn't work very well because AWS throttles calls to EMR and hence some of the jobs were killed before being submitted. 但是,这并不是很好,因为AWS限制了对EMR的调用,因此有些作业在提交之前就被杀死了。

Current Best Solution 当前最佳解决方案

from jobs import MyMRJob
import logging

logging.basicConfig(
    level=logging.INFO,
    format = '%(asctime)-15s %(levelname)-8s %(message)s',
)
log = logging.getLogger('submitjobs')

def main():
    cluster_id="x-MXMXMX"
    log.info('Cluster: %s', cluster_id)
    for i in range(10):
        n = '%04d' % i
        log.info('Adding job: %s', n)
        mr_job = MyMRJob(args=[
            '-r', 'emr',
            '--conf-path', 'mrjob.conf',
            '--no-output',
            '--output-dir', 's3://mybucket/mrjob/%s' % n,
            '--cluster-id', cluster_id,
            'input/file.%s' % n
    ])
    runner = mr_job.make_runner()
    # the following is the secret sauce, submits the job and returns
    # it is a private method though, so may be changed without notice
    runner._launch()

if __name__ == '__main__':
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM