简体   繁体   中英

How do I trigger a AWS lambda function only if bulk upload finished on S3?

We have a simple ETL setup below

  1. Vendor upload crawled parquet data to our S3 bucket.
  2. S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue.

This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.

I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:

  1. Vendor upload file to S3.
  2. S3 send event to SQS.
  3. SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes.
  4. After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.

Is this doable with S3 and SQS?

SQS hold the event,

Yes, you can do this, as you can setup SQS delay to up to 15 minues.

wait until there has been no other following event for a given time period, say 5 minutes.

No, there is not automated way for that. You have to develop your own custom solution . The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (eg every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM