简体繁体 English

AWS胶水ETL作业在批次的S3事件上触发

[英]AWS Glue ETL Job triggered on batches of S3 Events

原文 2019-04-15 22:22:15 4 1 amazon-web-services/ bigdata/ etl/ aws-glue

I have an S3 bucket that gets many files dropped in it (1000 records/min). 我有一个S3存储桶，其中包含许多文件（1000条记录/分钟）。 I want to trigger a Glue ETL job on batches of these dropped files. 我想在批量删除的文件上触发Glue ETL作业。

I have looked at using Firehose to aggregate the batches of the events, but that requires a lot of chained resources. 我已经看过使用Firehose来聚合事件的批次，但这需要大量的链接资源。 Like S3 -> Lambda -> Firehose -> ... 喜欢S3 - > Lambda - > Firehose - > ......

What is the best way to process my data in batches? 批量处理数据的最佳方法是什么？

1 个解决方案

You can use AWS Glue Job Triggers which will allow you to run the glue job at scheduled intervals, rather than as an S3 event trigger? 您可以使用AWS Glue Job Triggers，它允许您以预定的时间间隔运行粘合作业，而不是作为S3事件触发器运行？

Are you processing streaming data? 你在处理流数据吗？ Don't see a use case / purpose for Firehose, with your limited information. 在您的信息有限的情况下，请勿查看Firehose的用例/用途。

如何通过 S3 事件或 AWS Lambda 触发 Glue ETL Pyspark 作业？ - How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

从 AWS Redshift 到 S3 的 AWS Glue ETL 作业失败 - AWS Glue ETL job from AWS Redshift to S3 fails

AWS Glue ETL：将数据传输到S3存储桶 - AWS Glue ETL : transfer data to S3 Bucket

AWS Glue：ETL 读取 S3 CSV 文件 - AWS Glue: ETL to read S3 CSV files

在S3触发的AWS中运行ETL python脚本 - Run ETL python script in AWS triggered by S3

无法填充 AWS Glue ETL 作业指标 - Not able to populate AWS Glue ETL Job metrics

使用日期作为变量为 ETL 参数化 AWS Glue 作业 - Parameterize AWS Glue Job for ETL with Date as variables

ETL：在AWS胶粘作业中展平嵌套数组 - ETL : Flatten a nested array in an AWS glue job

使用 AWS Glue ETL 将镶木地板文件从 S3 加载到 AWS RDS 需要很长时间 - Loading parquet file from S3 to AWS RDS taking extremely long time using AWS Glue ETL

运行 AWS Glue ETL 作业并命名 output 文件名时，有没有办法从 S3 存储桶读取文件名。 pyspark 是否提供了一种方法来做到这一点？ - Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过 S3 事件或 AWS Lambda 触发 Glue ETL Pyspark 作业？ - How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda? 从 AWS Redshift 到 S3 的 AWS Glue ETL 作业失败 - AWS Glue ETL job from AWS Redshift to S3 fails AWS Glue ETL：将数据传输到S3存储桶 - AWS Glue ETL : transfer data to S3 Bucket AWS Glue：ETL 读取 S3 CSV 文件 - AWS Glue: ETL to read S3 CSV files 在S3触发的AWS中运行ETL python脚本 - Run ETL python script in AWS triggered by S3 无法填充 AWS Glue ETL 作业指标 - Not able to populate AWS Glue ETL Job metrics 使用日期作为变量为 ETL 参数化 AWS Glue 作业 - Parameterize AWS Glue Job for ETL with Date as variables ETL：在AWS胶粘作业中展平嵌套数组 - ETL : Flatten a nested array in an AWS glue job 使用 AWS Glue ETL 将镶木地板文件从 S3 加载到 AWS RDS 需要很长时间 - Loading parquet file from S3 to AWS RDS taking extremely long time using AWS Glue ETL 运行 AWS Glue ETL 作业并命名 output 文件名时，有没有办法从 S3 存储桶读取文件名。 pyspark 是否提供了一种方法来做到这一点？ - Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM