[英]What actions does job.commit perform in aws glue?
Every job script code should be ended with job.commit()
but what exact action this function do?每个作业脚本代码都应该以
job.commit()
结束,但是这个函数到底做了什么动作呢?
job.commit()
is called?job.commit()
之后执行任何 python 语句是否安全? PS I have not found any description in PyGlue.zip
with aws py source code :( PS我在
PyGlue.zip
没有找到任何带有 aws py 源代码的描述:(
As of today, the only case where the Job object is useful is when using Job Bookmarks.截至今天,Job 对象唯一有用的情况是使用 Job Bookmarks 时。 When you read files from Amazon S3 ( only supported source for bookmarks so far ) and call your
job.commit
, a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.当您从 Amazon S3 读取文件(目前仅支持书签源)并调用您的
job.commit
,到目前为止读取的时间和路径将在内部存储,因此如果由于某种原因您再次尝试读取该路径,您只会取回未读(新)文件。
In this code sample, I try to read and process two different paths separately, and commit after each path is processed.在此代码示例中,我尝试分别读取和处理两个不同的路径,并在处理完每个路径后提交。 If for some reason I stop my job, the same files won't be processed.
如果由于某种原因我停止工作,则不会处理相同的文件。
args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)
paths = [
's3://bucket-name/my_partition=apples/',
's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
try:
dynamic_frame = glue_context.create_dynamic_frame_from_options(
connection_type='s3',
connection_options={'paths'=[s3_path]},
format='json',
transformation_ctx="path={}".format(path))
do_something(dynamic_frame)
# Commit file read to Job Bookmark
job.commit()
except:
# Something failed
Calling the commit method on a Job
object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark.只有在启用了 Job Bookmark 的情况下,才能在
Job
对象上调用 commit 方法,并且存储的引用会从 JobRun 保留到 JobRun,直到您重置或暂停 Job Bookmark。 It is completely safe to execute more python statements after a Job.commit
, and as shown on the previous code sample, committing multiple times is also valid.在
Job.commit
之后执行更多的 python 语句是完全安全的,并且如前面的代码示例所示,多次提交也是有效的。
Hope this helps希望这有帮助
According to the AWS support team, commit
should not be called more than once.根据 AWS 支持团队的说法,不应多次调用
commit
。 Here is the exact response I got from them:这是我从他们那里得到的确切答复:
The method job.commit() can be called multiple times and it would not throw any error
as well. However, if job.commit() would be called multiple times in a Glue script
then job bookmark will be updated only once in a single job run that would be after
the first time when job.commit() gets called and the other calls for job.commit()
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and
would not able to work well with multiple job.commit(). Thus, I would recommend you
to use job.commit() once in the Glue script.
To expand on @yspotts answer.扩展@yspotts 的答案。 It is possible to execute more than one
job.commit()
in an AWS Glue Job script, although the bookmark will be updated only once, as they mentioned.可以在 AWS Glue 作业脚本中执行多个
job.commit()
,但正如他们所提到的,书签只会更新一次。 However , it is also safe to call job.init()
more than once.但是,
job.init()
调用job.init()
也是安全的。 In this case, the bookmarks will be updated correctly with the S3 files processed since the previous commit.在这种情况下,书签将使用自上次提交以来处理的 S3 文件正确更新。 If
false
, it does nothing.如果为
false
,则什么都不做。
In the init()
function, there is an "initialised" marker that gets updated and set to true
.在
init()
函数中,有一个“初始化”标记会更新并设置为true
。 Then, in the commit()
function this marker is checked, if true
then it performs the steps to commit the bookmarker and reset the "initialised" marker.然后,在
commit()
函数中检查此标记,如果为true
则执行提交书签并重置“初始化”标记的步骤。
So, the only thing to change from @hoaxz answer would be to call job.init()
in every iteration of the for loop:因此,从@hoaxz 答案中唯一改变的是在 for 循环的每次迭代中调用
job.init()
:
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
paths = [
's3://bucket-name/my_partition=apples/',
's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for s3_path in paths:
job.init(args[‘JOB_NAME’], args)
dynamic_frame = glue_context.create_dynamic_frame_from_options(
connection_type='s3',
connection_options={'paths'=[s3_path]},
format='json',
transformation_ctx="path={}".format(path))
do_something(dynamic_frame)
# Commit file read to Job Bookmark
job.commit()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.