简体   繁体   English

只有在 S3 上完成批量上传后,如何触发 AWS lambda 函数?

[英]How do I trigger a AWS lambda function only if bulk upload finished on S3?

We have a simple ETL setup below我们在下面有一个简单的 ETL 设置

  1. Vendor upload crawled parquet data to our S3 bucket.供应商将爬取的 parquet 数据上传到我们的 S3 存储桶。
  2. S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue. S3 事件触发一个 lambda 函数,该函数将触发胶水爬虫更新胶水中现有的表分区。

This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data.这在大多数情况下都可以正常工作,但在某些情况下,我们的供应商可能会在短时间内连续上传文件,例如在刷新历史数据时。 This will cause an issue since glue crawler cannot run concurrently and the job will fail.这将导致问题,因为胶水爬虫无法同时运行并且作业将失败。

I'm wondering if there is anything we can do to avoid the potential error.我想知道我们是否可以做些什么来避免潜在的错误。 I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:我研究了SQS ,但不确定这是否对我有帮助,以下是我想要实现的目标:

  1. Vendor upload file to S3.供应商将文件上传到 S3。
  2. S3 send event to SQS. S3 向 SQS 发送事件。
  3. SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes. SQS 保持事件,等到在给定的时间段内没有其他后续事件,比如 5 分钟。
  4. After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.在 5 分钟内没有进一步事件后,SQS 触发 lambda 函数运行胶水爬虫。

Is this doable with S3 and SQS?这对 S3 和 SQS 可行吗?

SQS hold the event, SQS举办活动,

Yes, you can do this, as you can setup SQS delay to up to 15 minues.是的,您可以这样做,因为您可以将SQS 延迟设置为最多 15 分钟。

wait until there has been no other following event for a given time period, say 5 minutes.等到在给定的时间段内没有其他后续事件,比如 5 分钟。

No, there is not automated way for that.不,没有自动化的方法。 You have to develop your own custom solution .您必须开发自己的定制解决方案 The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (eg every 5 minutes).最简单的方法是不将 SQS 与 lambda 捆绑在一起,而是让 lambda 按计划运行(例如每 5 分钟一次)。 Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Lambda 必须有逻辑来确定一段时间后是否没有新文件上传,然后触发您的 Glue 作业。 Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.这可能会涉及到 DynamoDB 来跟踪 lambda 执行之间最后上传的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 AWS Lambda 通过 S3 触发 Comprehend? - How do I use AWS Lambda to trigger Comprehend with S3? AWS - 想要将多个文件上传到 S3 并且仅当所有文件都上传时触发 lambda function - AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function 如何在CloudFormation中添加带有S3触发器的Lambda函数? - How do I add a Lambda Function with an S3 Trigger in CloudFormation? 无法通过上传到AWS S3来触发AWS Lambda - Unable to trigger AWS Lambda by upload to AWS S3 如何在 S3 文件上传时触发 Lambda - How to trigger a Lambda on S3 file upload 如何确保 Lambda function 在部署后的每个 S3 Put Trigger 上写入 Cloudwatch 日志? - How do I ensure that a Lambda function writes to Cloudwatch Logs on every S3 Put Trigger after deploy? 如何在lambda函数中添加s3触发器? - How to add s3 trigger to lambda function? 从s3上传到一定时间范围内触发lambda函数 - Trigger lambda function in a certain time range from s3 upload 如何使用 AWS CDK 将现有 Lambda 函数的别名指定为 DynamoDB 触发器? - How do I specify an existing Lambda function's alias as a DynamoDB trigger using the AWS CDK? 如何使用 python 中的 AWS lambda function 将 excel 文件上传到 AWS S3 - How to upload excel file to AWS S3 using an AWS lambda function in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM