简体   繁体   中英

AWS Glue Job - pass glue catalog table names as parameters

I have a AWS Glue job in pyspark language which loads data from S3/Glue catalog db to snowflake. How can we achieve passing table names as parameters and run the aws glue job in parallel.

can we do it inside glue job or any lambda functions?

Please suggest and share any code/articles.

Thank you in advance.

Thanks, Jo

AWS Glue lets you enter your own script, so it's very flexible. You can pass table names as parameters:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html

In this case, Glue job can process these tables sequentially:

  • parse the input parameters
  • create a loop for each table
    • create_dynamic_frame for reading table
    • transform if necessary
    • write to snowflake
    • process next table

If you want to run separate Glue jobs for each table to process them in parallel , then you need to pass only one table to the Glue job, and call the same job for multiple time with a different table name.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html

Glue launches an EMR cluster based on the "Number of workers".

I do not know how many tables you will process, and the frequency of calling the Glue job, but it could be better to process the tables sequentialy with a bigger cluster to utilize resources.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM