简体   繁体   中英

How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.

As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?

I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: ie a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:

  1. Send the notifications from the publisher to an SQS queue .

  2. Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.

  3. When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.

  4. Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.

Things to keep in mind on the above:

  1. As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.

  2. SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.

Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.

Solutions:

  1. Zip all files together, Lambda unzip it
  2. create a UI code and send files one by one, trigger lambda from it when the last one is sent
  3. Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM