简体   繁体   English

参数化/可重复使用的AWS Glue作业

[英]Parametrized/reusable AWS Glue job

I am new to AWS and I'm trying to create a parameterized AWS Glue job which should have input parameters: 我是AWS的新手,我正在尝试创建一个参数化的AWS Glue作业,该作业应具有输入参数:

  1. Datasource 数据源
  2. Data size 资料大小
  3. Count 计数
  4. Variable List 变量列表

Has anyone done something similar before? 有人做过类似的事情吗?

First of all, I am not sure that you will be able to limit the data by size. 首先,我不确定您是否可以通过大小限制数据。 Instead of that I suggest to limit the data by number of rows. 相反,我建议按行数限制数据。 Two of first variables you can put into your jobs as I described in AWS Glue Job Input Parameters . 正如我在AWS Glue作业输入参数中所述,您可以在作业中放入两个第一变量。 When it comes to the variable list, if it is a big number of the variables, I am worry that you will not able to provide these inputs by using standard way. 当涉及到变量列表时,如果其中包含大量变量,我担心您将无法使用标准方法提供这些输入。 In this case I suggest to provide these variables in the same way like the data, I mean by using flat file. 在这种情况下,我建议以与数据相同的方式提供这些变量,即使用平面文件。 For example: 例如:

var1;var2;var3
1;2;3 

Summarizing, I suggest to define the following input variables: 总结一下,我建议定义以下输入变量:

  1. Datasource (the path to the place in S3 where you store the data, you can also split this variable into two variables - database and table (in Glue data catalog)) 数据源(S3中存储数据的位置的路径,也可以将此变量分为两个变量-数据库和表(在Glue数据目录中))
  2. Rows count (numbers of rows which you want to select) 行数(要选择的行数)
  3. Variables source (the path to the place in S3 where you store file with variables) 变量源(S3中存储带变量的文件的位置的路径)

This is example of the code: 这是代码示例:

import sys 
from awsglue.transforms import * 
from awsglue.utils import getResolvedOptions 
from pyspark.context import SparkContext 
from awsglue.context import GlueContext 
from awsglue.job import Job 
## @params: [JOB_NAME] 
args = getResolvedOptions(sys.argv, ['JOB_NAME','SOURCE_DB','SOURCE_TAB','NUM_ROWS','DEST_FOLDER']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
spark = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 
df_new = glueContext.create_dynamic_frame.from_catalog(database = args['SOURCE_DB'], table_name = args['SOURCE_TAB'], transformation_ctx = "full_data") 
df_0 = df_new.toDF() 
df_0.createOrReplaceTempView("spark_dataframe") 
choice_data = spark.sql("Select x,y,z from spark_dataframe") 
choice_data = choice_data.limit(int(args['NUM_ROWS']))
choice_data.repartition(1).write.format('csv').mode('overwrite').options(delimiter=',',header=True).save("s3://"+ args['DEST_FOLDER'] +"/")

job.commit() job.commit()

Of course you have also to provide proper input variables in Glue job configuration. 当然,您还必须在Glue作业配置中提供适当的输入变量。

args = getResolvedOptions(sys.argv, ['JOB_NAME','source_db','source_table','count','dest_folder']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
spark = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 
df_new = glueContext.create_dynamic_frame.from_catalog(database = args['source_db'], table_name = args['source_table'], transformation_ctx = "sample_data") 
df_0 = df_new.toDF() 
df_0.registerTempTable("spark_dataframe") 
new_data = spark.sql("Select * from spark_dataframe") 
sample = new_data.limit(args['count'])
sample.repartition(1).write.format('csv').options(delimiter=',',header=True).save("s3://"+ args['dest_folder'] +"/")
job.commit()

I am getting error for line 
sample = new_data.limit(args['count'])

error: 
py4j.Py4JException: Method limit([class java.lang.String]) does not exist 

but the argument passed is not a string.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM