简体   繁体   English

是否可以创建将使用多个参数运行相同作业的 AWS Glue 工作流?

[英]Is it possible to create AWS Glue workflow that will run the same job with multiple parameters?

Looks like I cannot add the same job to a workflow more than once but I need to run the same Glue job multiple times with different parameters.看起来我不能多次将相同的作业添加到工作流中,但我需要使用不同的参数多次运行相同的 Glue 作业。 Not possible?不可能?

Yes you can always set properties on your workflow and then you can access them in your job as shown in below .是的,您始终可以在工作流程中设置属性,然后您可以在您的工作中访问它们,如下所示。 Refer to this link for more information.有关更多信息,请参阅链接。

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

glue_client = boto3.client("glue")
args = getResolvedOptions(sys.argv, ['JOB_NAME','WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,
                                        RunId=workflow_run_id)["RunProperties"]

target_database = workflow_params['target_database']
target_s3_location = workflow_params['target_s3_location']

Now if you want properties to be changed for every run then you can do the same by using put_workflow_run_properties API call.现在,如果您希望为每次运行更改属性,那么您可以使用put_workflow_run_properties API 调用来执行相同的操作。 This can be scheduled to run before your actual job starts it's execution so that the parameters are updated.这可以安排在您的实际作业开始执行之前运行,以便更新参数。 This API call can be scheduled via lambda or as a part of your workflow which runs before your actual job starts it's execution.此 API 调用可以通过 lambda 进行调度,也可以作为在实际作业开始执行之前运行的工作流的一部分进行调度。

Answer is... sort of.答案是……有点。 You can indeed start the same job with different parameters, however in a workflow, assuming you are waiting for the jobs to finish, you cannot identify the job by anything other than its name.您确实可以使用不同的参数启动相同的作业,但是在工作流程中,假设您正在等待作业完成,您无法通过名称以外的任何内容来识别作业。 So if you have job j that you run with params --a and --b , then run crawlers afterward, the predicate for the crawler only allows you to specify that you're waiting for job j , so cannot distinguish between the two instances.因此,如果您有使用参数--b --a的作业j ,然后运行爬虫,爬虫的谓词只允许您指定您正在等待作业j ,因此无法区分这两个实例.

However, keep in mind that the job is just a container for properties including the the underlying glue script(s) so you can create as many jobs as needed to call the same script with different parameters.但是,请记住,该作业只是包含底层粘合脚本的属性的容器,因此您可以根据需要创建尽可能多的作业来调用具有不同参数的相同脚本。 Because they will have different names, this strategy is fine.因为他们会有不同的名字,所以这个策略很好。

But still -- annoying!但仍然——烦人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM