简体   繁体   English

一步一步设置aws ETL胶水pyspark作业指南

[英]Guide to setup aws ETL glue pyspark job by step by step

I am beginner for AWS pipelines.我是 AWS 管道的初学者。

Input I recieve csvs tables in S3 buckets RAW_input from ERPs everyday, time can be changed.输入我每天从 ERP 接收 S3 存储桶RAW_input中的csvs表,时间可以更改。 For example- ERP1_folder contains erp1_sales.csv and erp1_customer.csv and same for ERP2_folder .例如- ERP1_folder包含erp1_sales.csverp1_customer.csvERP2_folder相同。

Transformation Then we need to apply tested query (SQLsfiles are in S3) and apply mapping + structure cleaning (Glue Jobs) such as int change, data format change etc. for each table in buckets and setting up output in required destination .转换然后我们需要应用tested query (SQLsfiles are in S3)并应用mapping + structure cleaning (Glue Jobs) ,例如对存储桶中的each表进行 int 更改、数据格式更改等,并在所需的destination设置output

Destination: Output of the queries is expected in redshift tables .目的地: Output 的queries预计在 redshift redshift tables中。 (We have clusters and database ready for it). (我们已经为它准备好了集群和数据库)。

Req I would like to setup a pyspark glue job that trigger automatically when any file uploaded into S3 bucket. Req我想设置一个 pyspark 粘合作业,当任何文件上传到 S3 存储桶时自动触发。

Questions问题

  1. Can only Glue job can do all work such as input s3 trigger + mapping + cleaning ?只有 Glue 工作才能完成input s3 trigger + mapping + cleaning等所有工作吗?
  2. Do I still need to develop aws lambda function in this process to trigger glue job?我是否还需要在此过程中开发aws lambda function 来trigger胶水作业?
  3. Allocation of query trigger per selected table by job sounds a tough work.job为每个selected table分配query trigger听起来是一项艰巨的工作。
  4. note: clusters in AWS glue is created mannually, I don't know its the only way to do it or not.注意: AWS胶水中的clusters是手动创建的,我不知道它是否是唯一的方法。
  5. Any References / github codes for pyspark can be helpful任何参考/pyspark 的 github 代码都会有所帮助

I can't have User secret access keys so I have to work only inside the AWS service by access roles and policy.我不能拥有用户secret access keys ,因此我只能通过访问角色和策略在 AWS 服务内工作。 NO CI/CD solution needed无需 CI/CD 解决方案

Please relavent comments only if you need some explaination.仅当您需要一些解释时才请相关评论。

I tried: But i prefer looping over different Athena SQL files in S3 over similar ERPs.我试过:但我更喜欢在 S3 中循环遍历不同的 Athena SQL 文件而不是类似的 ERP。 For example erp000.sql run over erp000.csv and send data to redshift例如 erp000.sql 运行 erp000.csv 并将数据发送到 redshift

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SQLContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# EXTRACT: Reading parquet data
df = spark.read.format("csv").load('s3://<bucket_name>/<file_path>/<file_name>.csv')

# TRANSFORM: some transformation
df = df.distinct()

# LOAD: write data to Redshift
df.write.format("jdbc").\
    option("url", "jdbc:redshift://<host>:5439/<database>").\
    option("dbtable", "<table_schema>.<table_name>").\
    option("user", "<username>").\
    option("password", "<password>").\
    mode('overwrite').save()
    
print("Data Loaded to Redshift")

I'll try to answer directly your questions:我将尝试直接回答您的问题:

Can only Glue job can do all work such as input s3 trigger+ mapping + cleaning?只有Glue作业才能完成输入s3触发器+映射+清洗等所有工作吗?

A trigger starts the ETL job execution on-demand or at a specific time.触发器按需或在特定时间启动 ETL 作业执行。

Do I still need to develop aws lambda function in this process to trigger glue job?我是否还需要在此过程中开发 aws lambda function 来触发胶水作业?

Nope没有

Allocation of query trigger per selected table by job sounds a tough work.按作业为每个选定表分配查询触发器听起来是一项艰巨的工作。

With the final tables in place, you can create Glue Jobs, which can be run on a schedule, on a trigger, or on-demand.最终表格到位后,您可以创建 Glue 作业,它可以按计划、触发器或按需运行。 A job is business logic that carries out an ETL task.作业是执行 ETL 任务的业务逻辑。 So you create job based on a selection of tables and not a job per table.因此,您根据选择的表创建作业,而不是每个表的作业。

clusters in AWS glue are created manually, I don't know its the only way to do it or not. AWS 胶水中的集群是手动创建的,我不知道它是否是唯一的方法。

AWS Glue is simply a serverless ETL tool, you don't need to manage clusters, you don't create clusters manually. AWS Glue 只是一个无服务器 ETL 工具,您不需要管理集群,也不需要手动创建集群。

Any References / github codes for pyspark can be helpful任何参考/pyspark 的 github 代码都会有所帮助

https://aws.amazon.com/glue/getting-started/ https://docs.aws.amazon.com/glue/index.html https://docs.aws.amazon.com/glue/latest/dg/trigger-job.html https://aws.amazon.com/glue/getting-started/ https://docs.aws.amazon.com/glue/index.html https://docs.aws.amazon.com/glue/latest/dg /触发作业.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM