简体   繁体   中英

Parametrized/reusable AWS Glue job

I am new to AWS and I'm trying to create a parameterized AWS Glue job which should have input parameters:

  1. Datasource
  2. Data size
  3. Count
  4. Variable List

Has anyone done something similar before?

First of all, I am not sure that you will be able to limit the data by size. Instead of that I suggest to limit the data by number of rows. Two of first variables you can put into your jobs as I described in AWS Glue Job Input Parameters . When it comes to the variable list, if it is a big number of the variables, I am worry that you will not able to provide these inputs by using standard way. In this case I suggest to provide these variables in the same way like the data, I mean by using flat file. For example:

var1;var2;var3
1;2;3 

Summarizing, I suggest to define the following input variables:

  1. Datasource (the path to the place in S3 where you store the data, you can also split this variable into two variables - database and table (in Glue data catalog))
  2. Rows count (numbers of rows which you want to select)
  3. Variables source (the path to the place in S3 where you store file with variables)

This is example of the code:

import sys 
from awsglue.transforms import * 
from awsglue.utils import getResolvedOptions 
from pyspark.context import SparkContext 
from awsglue.context import GlueContext 
from awsglue.job import Job 
## @params: [JOB_NAME] 
args = getResolvedOptions(sys.argv, ['JOB_NAME','SOURCE_DB','SOURCE_TAB','NUM_ROWS','DEST_FOLDER']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
spark = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 
df_new = glueContext.create_dynamic_frame.from_catalog(database = args['SOURCE_DB'], table_name = args['SOURCE_TAB'], transformation_ctx = "full_data") 
df_0 = df_new.toDF() 
df_0.createOrReplaceTempView("spark_dataframe") 
choice_data = spark.sql("Select x,y,z from spark_dataframe") 
choice_data = choice_data.limit(int(args['NUM_ROWS']))
choice_data.repartition(1).write.format('csv').mode('overwrite').options(delimiter=',',header=True).save("s3://"+ args['DEST_FOLDER'] +"/")

job.commit()

Of course you have also to provide proper input variables in Glue job configuration.

args = getResolvedOptions(sys.argv, ['JOB_NAME','source_db','source_table','count','dest_folder']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
spark = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 
df_new = glueContext.create_dynamic_frame.from_catalog(database = args['source_db'], table_name = args['source_table'], transformation_ctx = "sample_data") 
df_0 = df_new.toDF() 
df_0.registerTempTable("spark_dataframe") 
new_data = spark.sql("Select * from spark_dataframe") 
sample = new_data.limit(args['count'])
sample.repartition(1).write.format('csv').options(delimiter=',',header=True).save("s3://"+ args['dest_folder'] +"/")
job.commit()

I am getting error for line 
sample = new_data.limit(args['count'])

error: 
py4j.Py4JException: Method limit([class java.lang.String]) does not exist 

but the argument passed is not a string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM