简体   繁体   English

AWS Glue 作业输入参数

[英]AWS Glue Job Input Parameters

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created.我对 AWS 比较陌生,这可能不是技术问题,但目前 AWS Glue 指出最多允许创建 25 个工作岗位。 We are loading in a series of tables that each have their own job that subsequently appends audit columns.我们正在加载一系列表,每个表都有自己的工作,随后附加审计列。 Each job is very similar, but simply changes the connection string source and target.每个作业都非常相似,只是简单地更改了连接字符串的源和目标。

Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them?有没有办法参数化这些作业以允许重用并简单地将正确的连接字符串传递给它们? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through?或者甚至可能循环通过主作业中的一组连接字符串,该主作业会调用通过不同连接字符串的子作业?

Any examples or documentation would be most appreciated任何示例或文档将不胜感激

In the below example I present how to use Glue job input parameters in the code.在下面的示例中,我展示了如何在代码中使用 Glue 作业输入参数。 This code takes the input parameters and it writes them to the flat file.此代码采用输入参数并将它们写入平面文件。

  1. Setting the input parameters in the job configuration.在作业配置中设置输入参数。

在此处输入图片说明

  1. The code of Glue job胶水作业代码
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
 
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER'])
job.init(args['JOB_NAME'], args)

v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}]

df=sc.parallelize(v_list).toDF()
df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/")

job.commit()
  1. There is also possible to provide input parameters during using boto3, CloudFormation or StepFunctions.还可以在使用 boto3、CloudFormation 或 StepFunctions 期间提供输入参数。 This example shows how to do it by using boto3.这个例子展示了如何使用 boto3 来做到这一点。
import boto3
    
def lambda_handler(event, context):
    glue = boto3.client('glue')
        
        
    myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole',
                            Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'},
                            DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"}       
                                   )
    glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"})

Useful links:有用的链接:

  1. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
  2. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
  3. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
  4. https://docs.aws.amazon.com/step-functions/latest/dg/connectors-glue.html https://docs.aws.amazon.com/step-functions/latest/dg/connectors-glue.html

Is it possible to derive the account when you are running a glue job?运行胶水作业时是否可以导出帐户? Like what instance it is being triggered from - DEV, Test, PROD?比如它是从哪个实例触发的——DEV、Test、PROD?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM