简体   繁体   English

job.commit 在 aws 胶水中执行哪些操作?

[英]What actions does job.commit perform in aws glue?

Every job script code should be ended with job.commit() but what exact action this function do?每个作业脚本代码都应该以job.commit()结束,但是这个函数到底做了什么动作呢?

  1. Is it just job end marker or not?它只是工作结束标记吗?
  2. Can it be called twice during one job (if yes - in what cases)?是否可以在一项工作中调用两次(如果是 - 在什么情况下)?
  3. Is it safe to execute any python statement after job.commit() is called?job.commit()之后执行任何 python 语句是否安全?

PS I have not found any description in PyGlue.zip with aws py source code :( PS我在PyGlue.zip没有找到任何带有 aws py 源代码的描述:(

As of today, the only case where the Job object is useful is when using Job Bookmarks.截至今天,Job 对象唯一有用的情况是使用 Job Bookmarks 时。 When you read files from Amazon S3 ( only supported source for bookmarks so far ) and call your job.commit , a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.当您从 Amazon S3 读取文件(目前仅支持书签源)并调用您的job.commit ,到目前为止读取的时间和路径将在内部存储,因此如果由于某种原因您再次尝试读取该路径,您只会取回未读(新)文件。

In this code sample, I try to read and process two different paths separately, and commit after each path is processed.在此代码示例中,我尝试分别读取和处理两个不同的路径,并在处理完每个路径后提交。 If for some reason I stop my job, the same files won't be processed.如果由于某种原因我停止工作,则不会处理相同的文件。

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    try:
        dynamic_frame = glue_context.create_dynamic_frame_from_options(
            connection_type='s3',
            connection_options={'paths'=[s3_path]},
            format='json',
            transformation_ctx="path={}".format(path))
        do_something(dynamic_frame)
        # Commit file read to Job Bookmark
        job.commit()
    except:
        # Something failed

Calling the commit method on a Job object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark.只有在启用了 Job Bookmark 的情况下,才能在Job对象上调用 commit 方法,并且存储的引用会从 JobRun 保留到 JobRun,直到您重置或暂停 Job Bookmark。 It is completely safe to execute more python statements after a Job.commit , and as shown on the previous code sample, committing multiple times is also valid.Job.commit之后执行更多的 python 语句是完全安全的,并且如前面的代码示例所示,多次提交也是有效的。

Hope this helps希望这有帮助

According to the AWS support team, commit should not be called more than once.根据 AWS 支持团队的说法,不应多次调用commit Here is the exact response I got from them:这是我从他们那里得到的确切答复:

The method job.commit() can be called multiple times and it would not throw any error 
as well. However, if job.commit() would be called multiple times in a Glue script 
then job bookmark will be updated only once in a single job run that would be after 
the first time when job.commit() gets called and the other calls for job.commit() 
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and 
would not able to work well with multiple job.commit(). Thus, I would recommend you 
to use job.commit() once in the Glue script.

To expand on @yspotts answer.扩展@yspotts 的答案。 It is possible to execute more than one job.commit() in an AWS Glue Job script, although the bookmark will be updated only once, as they mentioned.可以在 AWS Glue 作业脚本中执行多个job.commit() ,但正如他们所提到的,书签只会更新一次。 However , it is also safe to call job.init() more than once.但是job.init()调用job.init()也是安全的。 In this case, the bookmarks will be updated correctly with the S3 files processed since the previous commit.在这种情况下,书签将使用自上次提交以来处理的 S3 文件正确更新。 If false , it does nothing.如果为false ,则什么都不做。

In the init() function, there is an "initialised" marker that gets updated and set to true .init()函数中,有一个“初始化”标记会更新并设置为true Then, in the commit() function this marker is checked, if true then it performs the steps to commit the bookmarker and reset the "initialised" marker.然后,在commit()函数中检查此标记,如果为true则执行提交书签并重置“初始化”标记的步骤。

So, the only thing to change from @hoaxz answer would be to call job.init() in every iteration of the for loop:因此,从@hoaxz 答案中唯一改变的是在 for 循环的每次迭代中调用job.init()

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for s3_path in paths:
    job.init(args[‘JOB_NAME’], args)
    dynamic_frame = glue_context.create_dynamic_frame_from_options(
        connection_type='s3',
        connection_options={'paths'=[s3_path]},
        format='json',
        transformation_ctx="path={}".format(path))
    do_something(dynamic_frame)
    # Commit file read to Job Bookmark
    job.commit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM