[英]Guide to setup aws ETL glue pyspark job by step by step
I am beginner for AWS pipelines.我是 AWS 管道的初学者。
Input I recieve csvs
tables in S3 buckets RAW_input
from ERPs everyday, time can be changed.输入我每天从 ERP 接收 S3 存储桶
RAW_input
中的csvs
表,时间可以更改。 For example- ERP1_folder
contains erp1_sales.csv
and erp1_customer.csv
and same for ERP2_folder
.例如-
ERP1_folder
包含erp1_sales.csv
和erp1_customer.csv
和ERP2_folder
相同。
Transformation Then we need to apply tested
query
(SQLsfiles are in S3)
and apply mapping + structure cleaning
(Glue Jobs)
such as int change, data format change etc. for each
table in buckets and setting up output
in required destination
.转换然后我们需要应用
tested
query
(SQLsfiles are in S3)
并应用mapping + structure cleaning
(Glue Jobs)
,例如对存储桶中的each
表进行 int 更改、数据格式更改等,并在所需的destination
设置output
。
Destination: Output of the queries
is expected in redshift tables
.目的地: Output 的
queries
预计在 redshift redshift tables
中。 (We have clusters and database ready for it). (我们已经为它准备好了集群和数据库)。
Req I would like to setup a pyspark glue job that trigger automatically when any file uploaded into S3 bucket. Req我想设置一个 pyspark 粘合作业,当任何文件上传到 S3 存储桶时自动触发。
Questions问题
input s3 trigger
+ mapping
+ cleaning
?input s3 trigger
+ mapping
+ cleaning
等所有工作吗?aws lambda
function in this process to trigger
glue job?aws lambda
function 来trigger
胶水作业?query trigger
per selected table
by job
sounds a tough work.job
为每个selected table
分配query trigger
听起来是一项艰巨的工作。clusters
in AWS
glue is created mannually, I don't know its the only way to do it or not.AWS
胶水中的clusters
是手动创建的,我不知道它是否是唯一的方法。I can't have User secret access keys
so I have to work only inside the AWS service by access roles and policy.我不能拥有用户
secret access keys
,因此我只能通过访问角色和策略在 AWS 服务内工作。 NO CI/CD solution needed无需 CI/CD 解决方案
Please relavent comments only if you need some explaination.仅当您需要一些解释时才请相关评论。
I tried: But i prefer looping over different Athena SQL files in S3 over similar ERPs.我试过:但我更喜欢在 S3 中循环遍历不同的 Athena SQL 文件而不是类似的 ERP。 For example erp000.sql run over erp000.csv and send data to redshift
例如 erp000.sql 运行 erp000.csv 并将数据发送到 redshift
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SQLContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# EXTRACT: Reading parquet data
df = spark.read.format("csv").load('s3://<bucket_name>/<file_path>/<file_name>.csv')
# TRANSFORM: some transformation
df = df.distinct()
# LOAD: write data to Redshift
df.write.format("jdbc").\
option("url", "jdbc:redshift://<host>:5439/<database>").\
option("dbtable", "<table_schema>.<table_name>").\
option("user", "<username>").\
option("password", "<password>").\
mode('overwrite').save()
print("Data Loaded to Redshift")
I'll try to answer directly your questions:我将尝试直接回答您的问题:
Can only Glue job can do all work such as input s3 trigger+ mapping + cleaning?只有Glue作业才能完成输入s3触发器+映射+清洗等所有工作吗?
A trigger starts the ETL job execution on-demand or at a specific time.
触发器按需或在特定时间启动 ETL 作业执行。
Do I still need to develop aws lambda function in this process to trigger glue job?我是否还需要在此过程中开发 aws lambda function 来触发胶水作业?
Nope
没有
Allocation of query trigger per selected table by job sounds a tough work.按作业为每个选定表分配查询触发器听起来是一项艰巨的工作。
With the final tables in place, you can create Glue Jobs, which can be run on a schedule, on a trigger, or on-demand.
最终表格到位后,您可以创建 Glue 作业,它可以按计划、触发器或按需运行。 A job is business logic that carries out an ETL task.
作业是执行 ETL 任务的业务逻辑。 So you create job based on a selection of tables and not a job per table.
因此,您根据选择的表创建作业,而不是每个表的作业。
clusters in AWS glue are created manually, I don't know its the only way to do it or not. AWS 胶水中的集群是手动创建的,我不知道它是否是唯一的方法。
AWS Glue is simply a serverless ETL tool, you don't need to manage clusters, you don't create clusters manually.
AWS Glue 只是一个无服务器 ETL 工具,您不需要管理集群,也不需要手动创建集群。
Any References / github codes for pyspark can be helpful任何参考/pyspark 的 github 代码都会有所帮助
https://aws.amazon.com/glue/getting-started/ https://docs.aws.amazon.com/glue/index.html https://docs.aws.amazon.com/glue/latest/dg/trigger-job.html
https://aws.amazon.com/glue/getting-started/ https://docs.aws.amazon.com/glue/index.html https://docs.aws.amazon.com/glue/latest/dg /触发作业.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.