简体   繁体   中英

Guide to setup aws ETL glue pyspark job by step by step

I am beginner for AWS pipelines.

Input I recieve csvs tables in S3 buckets RAW_input from ERPs everyday, time can be changed. For example- ERP1_folder contains erp1_sales.csv and erp1_customer.csv and same for ERP2_folder .

Transformation Then we need to apply tested query (SQLsfiles are in S3) and apply mapping + structure cleaning (Glue Jobs) such as int change, data format change etc. for each table in buckets and setting up output in required destination .

Destination: Output of the queries is expected in redshift tables . (We have clusters and database ready for it).

Req I would like to setup a pyspark glue job that trigger automatically when any file uploaded into S3 bucket.

Questions

  1. Can only Glue job can do all work such as input s3 trigger + mapping + cleaning ?
  2. Do I still need to develop aws lambda function in this process to trigger glue job?
  3. Allocation of query trigger per selected table by job sounds a tough work.
  4. note: clusters in AWS glue is created mannually, I don't know its the only way to do it or not.
  5. Any References / github codes for pyspark can be helpful

I can't have User secret access keys so I have to work only inside the AWS service by access roles and policy. NO CI/CD solution needed

Please relavent comments only if you need some explaination.

I tried: But i prefer looping over different Athena SQL files in S3 over similar ERPs. For example erp000.sql run over erp000.csv and send data to redshift

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SQLContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# EXTRACT: Reading parquet data
df = spark.read.format("csv").load('s3://<bucket_name>/<file_path>/<file_name>.csv')

# TRANSFORM: some transformation
df = df.distinct()

# LOAD: write data to Redshift
df.write.format("jdbc").\
    option("url", "jdbc:redshift://<host>:5439/<database>").\
    option("dbtable", "<table_schema>.<table_name>").\
    option("user", "<username>").\
    option("password", "<password>").\
    mode('overwrite').save()
    
print("Data Loaded to Redshift")

I'll try to answer directly your questions:

Can only Glue job can do all work such as input s3 trigger+ mapping + cleaning?

A trigger starts the ETL job execution on-demand or at a specific time.

Do I still need to develop aws lambda function in this process to trigger glue job?

Nope

Allocation of query trigger per selected table by job sounds a tough work.

With the final tables in place, you can create Glue Jobs, which can be run on a schedule, on a trigger, or on-demand. A job is business logic that carries out an ETL task. So you create job based on a selection of tables and not a job per table.

clusters in AWS glue are created manually, I don't know its the only way to do it or not.

AWS Glue is simply a serverless ETL tool, you don't need to manage clusters, you don't create clusters manually.

Any References / github codes for pyspark can be helpful

https://aws.amazon.com/glue/getting-started/ https://docs.aws.amazon.com/glue/index.html https://docs.aws.amazon.com/glue/latest/dg/trigger-job.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM