简体   繁体   中英

Rate-Limiting / Throttling SQS Consumer in conjunction with Step-Functions

Given following architecture:

在此处输入图像描述

The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).

How can this be address or circumvented?

We want to have full control of the rate-limiting.

1) Request concurrency increase .

This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.

2) Rate Limiting API

This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.

3) Adding SQS in front of SFN

This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting. As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.

4) FIFO-SQS in front of SFN

Something along the line what this blog-post is explaining. Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions. Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.

5) Kinesis Data Stream

By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).

6) Provisioned Concurrency

Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency . The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here. But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.

7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)

Assuming we have a SQS in front of SFN and a consumer-lambda. We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again. Something like:

在此处输入图像描述

However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.

8) ???

Question itself is a nice overview of all the major options. Well done.

You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.

If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here ). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM