简体   繁体   English

如何在 Glue 作业中访问 AWS Glue 工作流的运行属性?

[英]How to access run-property of AWS Glue workflow in Glue job?

I have been working with AWS Glue workflow for orchestrating batch jobs.我一直在使用 AWS Glue 工作流程来编排批处理作业。 we need to pass push-down-predicate in order to limit the processing for batch job.我们需要通过下推谓词来限制批处理作业的处理。 When we run Glue jobs alone, we can pass push down predicates as a command line argument at run time (ie aws glue start-job-run --job-name foo.scala --arguments --arg1-text ${arg1}..).当我们单独运行 Glue 作业时,我们可以在运行时将下推谓词作为命令行参数传递(即 aws glue start-job-run --job-name foo.scala --arguments --arg1-text ${arg1} ..)。 But when we use glue workflow to execute Glue jobs, it is bit unclear.但是当我们使用胶水工作流来执行胶水作业时,就有点不清楚了。

When we orchestrate Batch jobs using AWS Glue workflows, we can add run properties while creating workflow.当我们使用 AWS Glue 工作流程编排批处理作业时,我们可以在创建工作流程时添加运行属性。

  1. Can I use run properties to pass push down predicate for my Glue Job?我可以使用运行属性为我的 Glue 作业传递下推谓词吗?
  2. If yes, then how can I define value for the run property (push down predicate) at run time.如果是,那么我如何在运行时定义运行属性(下推谓词)的值。 The reason I want to define value for push down predicate at run time, is because the predicate arbitrarily changes every day.我想在运行时为下推谓词定义值的原因是因为谓词每天都在任意变化。 (ie run glue-workflow for past 10 days, past 20 days, past 2 days etc.) (即过去 10 天、过去 20 天、过去 2 天等运行胶水工作流程)

I tried:我试过了:

aws glue start-workflow-run --name workflow-name | aws 胶水启动工作流运行--名称工作流名称 | jq -r '.RunId ' jq -r '.RunId '

aws glue put-workflow-run-properties --name workflow-name --run-id "ID" --run-properties --pushdownpredicate="some value" aws 胶水 put-workflow-run-properties --name 工作流名称 --run-id "ID" --run-properties --pushdownpredicate="some value"

I am able to see the run property I have passed using put-workflow-run-property我可以看到我使用 put-workflow-run-property 传递的运行属性

aws glue put-workflow-run-properties --name workflow-name --run-id "ID" aws 胶水放置工作流运行属性 --name 工作流名称 --run-id "ID"

But I am not able to detect "pushdownpredicate" in my Glue Job.但我无法在我的 Glue Job 中检测到“pushdownpredicate”。 Any idea how to access workflow's run property in Glue Job?知道如何在 Glue Job 中访问工作流的运行属性吗?

在此处输入图像描述

If you are using python as programming language for your Glue job then you can issue get_workflow_run_properties API call to retrieve the property and use it inside your Glue job.如果您使用 python 作为 Glue 作业的编程语言,那么您可以发出get_workflow_run_properties API 调用来检索属性并在您的 Glue 作业中使用它。

response = client.get_workflow_run_properties(
    Name='string',
    RunId='string'
)

This will give you below response which you can parse and use it:这将为您提供以下响应,您可以解析和使用它:

{
    'RunProperties': {
        'string': 'string'
    }
}

If you are using scala then you can use equivalent AWS SDK.如果您使用的是 scala,那么您可以使用等效的 AWS 开发工具包。

In first instance you need to be sure that the job is running from a workflow:首先,您需要确保作业正在从工作流运行:

def get_worfklow_params(args: Dict[str, str]) -> Dict[str, str]:
    """
    get_worfklow_params is delegated to retrieve the WORKFLOW parameters
    """
    glue_client = boto3.client("glue")
    if "WORKFLOW_NAME" in args and "WORKFLOW_RUN_ID" in args:
        workflow_args = glue_client.get_workflow_run_properties(Name=args['WORKFLOW_NAME'], RunId=args['WORKFLOW_RUN_ID'])["RunProperties"]
        print("Found the following workflow args: \n{}".format(workflow_args))
        return workflow_args
    print("Unable to find run properties for this workflow!")
    return None

This method will return a map of the workflow input parameter.此方法将返回workflow输入参数的映射。

Than you can use the following method in order to retrieve a given parameter:您可以使用以下方法来检索给定参数:

def get_worfklow_param(args: Dict[str, str], arg) -> str:
    """
    get_worfklow_param is delegated to verify if the given parameter is present in the job and return it. In case of no presence None will be returned
    """
    if args is None:
        return None
    return args[arg] if arg in args else None

From reuse the code, in my opinion is better to create a python ( whl ) module and set the module in the script path of your job.从重用代码来看,我认为最好创建一个 python ( whl ) 模块并将该模块设置在您的作业的脚本路径中。 By this way, you can retrieve the method with a simple import.通过这种方式,您可以通过简单的导入来检索方法。

Without the whl module, you can move in the following way:如果没有whl模块,您可以按以下方式移动:


def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
    import boto3
    import sys
    from typing import Dict

    def get_worfklow_params(args: Dict[str, str]) -> Dict[str, str]:
    """
    get_worfklow_params is delegated to retrieve the WORKFLOW parameters
    """
    glue_client = boto3.client("glue")
    if "WORKFLOW_NAME" in args and "WORKFLOW_RUN_ID" in args:
        workflow_args = glue_client.get_workflow_run_properties(
            Name=args['WORKFLOW_NAME'], RunId=args['WORKFLOW_RUN_ID'])["RunProperties"]
        print("Found the following workflow args: \n{}".format(workflow_args))
        return workflow_args
    print("Unable to find run properties for this workflow!")
    return None

    def get_worfklow_param(args: Dict[str, str], arg) -> str:
    """
    get_worfklow_param is delegated to verify if the given parameter is present in the job and return it. In case of no presence None will be returned
    """
    if args is None:
        return None
    return args[arg] if arg in args else None

    _args = getResolvedOptions(sys.argv, ['JOB_NAME', 'WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
    worfklow_params = get_worfklow_params(_args)


    job_run_id = get_worfklow_param(_args, "WORKFLOW_RUN_ID")
    my_parameter= get_worfklow_param(_args, "WORKFLOW_CUSTOM_PARAMETER")

If you run Glue Job using workflow then sys.argv (which is a list) will contain parameters --WORKFLOW_NAME and --WORKFLOW_RUN_ID in it.如果您使用工作流运行 Glue 作业,则sys.argv (这是一个列表)将在其中包含参数--WORKFLOW_NAME--WORKFLOW_RUN_ID You can use this fact to check if a Glue Job is being run by Workflow or not and then retrieve the Workflow Runtime Properties您可以使用此事实来检查工作流是否正在运行 Glue 作业,然后检索工作流运行时属性

from awsglue.utils import getResolvedOptions

if '--WORKFLOW_NAME' in sys.argv and '--WORKFLOW_RUN_ID' in sys.argv:
    glue_args = getResolvedOptions(
        sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID']
    )

    workflow_args = glue_client.get_workflow_run_properties(
        Name=glue_args['WORKFLOW_NAME'], RunId=glue_args['WORKFLOW_RUN_ID']
    )["RunProperties"]

    return {**workflow_args}
else:
    raise Exception("GlueJobNotStartedByWorkflow")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM