简体   繁体   English

如何在多个 S3 通知上仅触发一次 AWS Lambda

[英]How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline.我们正在设计一个管道。 We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.我们得到一些进入 S3 存储桶的原始文件,然后我们应用一个模式,然后将它们保存为 parquet。

As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written.到目前为止,我们正在为每个写入的文件触发 lambda function,但理想情况下,我们希望仅在写入所有文件后才开始此过程。 How we can we trigger the lambda just once?我们怎样才能触发 lambda 一次?

I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you).我鼓励您使用一种替代方案,以保持发布者(无论谁在写作)和订阅者(您)之间的分离。 The publisher tells you when things are written;出版商会告诉你什么时候写的; it's your responsibility to choose when to process those things.您有责任选择何时处理这些事情。 The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: ie a list which says "I just wrote all these things, you can find them in these places".这里的简洁模式是让发布者分批编写其文件并发布清单供您触发:即一个列表,上面写着“我刚刚写了所有这些东西,你可以在这些地方找到它们”。 Since you don't have that / can't change the publisher, I suggest the following:由于您没有该/无法更改发布者,因此我建议以下内容:

  1. Send the notifications from the publisher to an SQS queue .将通知从发布者发送到SQS 队列

  2. Schedule your lambda to run on a schedule;安排您的 lambda 按计划运行; how often is determined by how long you're willing to delay ingestion.多久取决于您愿意延迟摄入多长时间。 If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min.如果您希望数据在发布和被系统摄取之间最多延迟 5 分钟,请将 lambda 设置为每 4 分钟触发一次。 You can use Cloudwatch notifications for this.您可以为此使用 Cloudwatch 通知。

  3. When your lambda runs, poll the queue.当您的 lambda 运行时,轮询队列。 Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.继续前进,直到您累积最大数量的通知 X,您想在一个 go 中处理,或者队列为空。

  4. Process.过程。 If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.如果停止轮询时队列不为空,则立即触发另一个 lambda 执行。

Things to keep in mind on the above:以上注意事项:

  1. As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.如所写,它不是并行的,因此如果您的 lambda 执行速度比队列填满的速度慢,您需要 1. 更频繁地运行或 2. 插入负载平衡步骤:lambda 是按计划触发,轮询队列,并根据需要调用尽可能多的处理 lambda,以便每个都收到 X 通知。

  2. SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery.一般的 SNS 和 SQS 非 FIFO 队列具体不保证完全一次交付。 They can send you duplicate notifications.他们可以向您发送重复的通知。 Make sure you can handle duplicate processing cleanly.确保您可以干净地处理重复处理。

Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.将您的 Lambda 连接到 Webhook(API 网关),然后在您的客户端应用程序完成后从您的客户端应用程序调用它。

Solutions:解决方案:

  1. Zip all files together, Lambda unzip it Zip 所有文件一起, Lambda 解压
  2. create a UI code and send files one by one, trigger lambda from it when the last one is sent创建一个UI代码并一个一个发送文件,发送最后一个时从它触发lambda
  3. Lambda check files, if didn't find all files, silent quit. Lambda 检查文件,如果没有找到所有文件,静默退出。 if it finds all files, then handle all files in one thread如果找到所有文件,则在一个线程中处理所有文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM