简体繁体 English

AWS Autoscaling on CloudWatch SQS 指标问题

[英]AWS Autoscaling on CloudWatch SQS metric problem

原文 2020-12-04 13:58:58 6 1 amazon-web-services/ amazon-sqs/ autoscaling/ aws-auto-scaling

In my aws account I have a ASG setup for my SQS consumer.在我的 aws 帐户中，我为我的 SQS 消费者设置了 ASG。 It has a min capacity of 3 and max capacity of 8. The termination policy is set to "default".它的最小容量为 3，最大容量为 8。终止策略设置为“默认”。 It has 2 simple scaling policies which are attached to a cloud watch alarm which monitors the size of the SQS queue.它有 2 个简单的扩展策略，附加到监控 SQS 队列大小的云监视警报。

Here is the threshold for the cloud watch alarm ApproximateNumberOfMessagesVisible >= 10 for 1 consecutive periods of 300 seconds for the metric dimensions .这是ApproximateNumberOfMessagesVisible >= 10 for 1 consecutive periods of 300 seconds for the metric dimensions的阈值。

When the cloud watch alarm state is "alarming" after 300 seconds then the ASG adds 1 instance until it hits the max capacity.当云监视警报 state 在 300 秒后“警报”时，ASG 会添加 1 个实例，直到达到最大容量。 Likewise, when the cloud watch alarm state is "ok" after 300 seconds then the ASG removes 1 instance until it hits the min capacity.同样，当云监视警报 state 在 300 秒后“正常”时，ASG 会删除 1 个实例，直到达到最小容量。

The ASG seems to scale up to max capacity with no issues. ASG 似乎可以毫无问题地扩展到最大容量。 The problem I'm running into however, occurs when the ASG scales back down.然而，我遇到的问题是在 ASG 缩减时发生的。 When the alarm state goes from "alarming" back to "ok" the ASG just seems to randomly pick an instance to shutdown.当警报 state 从“警报”变回“正常”时，ASG 似乎只是随机选择要关闭的实例。 This is a problem if the instance it is shutting down is currently processing an SQS message.如果它正在关闭的实例当前正在处理 SQS 消息，则这是一个问题。

For example, if my SQS queue has 20 visible messages then my ASG will scale up, lets say to 8. Once the visible messages are below or equal to 10 the ASG will start to terminate instances from my ASG.例如，如果我的 SQS 队列有 20 条可见消息，那么我的 ASG 将扩大，假设为 8。一旦可见消息低于或等于 10，ASG 将开始从我的 ASG 中终止实例。 But, it might pick a instance which is processing an SQS message.但是，它可能会选择一个正在处理 SQS 消息的实例。 If it does, then that SQS message goes into my DLQ.如果是这样，那么该 SQS 消息将进入我的 DLQ。

Has anyone run into this issue before?有没有人遇到过这个问题？

Is there a way to configure the ASG to monitor the SQS length and only terminate instances which have finished processing a messages?有没有办法将 ASG 配置为监控 SQS 长度并仅终止已完成消息处理的实例？ Maybe when the SQS is "ok" and the instance has low CPU?也许当 SQS 是“好的”并且实例的 CPU 较低时？ Or, should I be setting the threshold in my cloud watch alarm to something like 2?或者，我应该将我的云监视警报中的阈值设置为 2 之类的东西吗？

1 个解决方案

Your app needs to explicitly tell the asg an instance cannot currently be killed.您的应用程序需要明确告诉 asg 实例当前不能被杀死。 Check out the docs for Instance scale-in protection .查看Instance scale-in protection的文档。

You need to do something like this before starting to process the message:在开始处理消息之前，您需要执行以下操作：

aws autoscaling set-instance-protection --instance-ids i-5f2e8a0d --auto-scaling-group-name my-asg --protected-from-scale-in

Then process your message from the protected instance i-5f2e8a0d in autoscale group my-asg .然后处理来自自动缩放组my-asg中受保护实例i-5f2e8a0d的消息。 Finally deactivate instance protection when your done processing with:最后在您完成处理后停用实例保护：

aws autoscaling set-instance-protection --instance-ids i-5f2e8a0d --auto-scaling-group-name my-asg --no-protected-from-scale-in

Once a machine is protected the ASG will be unable to terminate it.一旦机器受到保护，ASG 将无法终止它。 Once the protection is turned off the instance is available to be terminated and autoscaling will continue to scale as expected.关闭保护后，可以终止实例，并且自动缩放将继续按预期扩展。 If all the instances are protected autoscaling will not terminate any instances (so be careful you always turn off instance protection or you might get stuck fully scaled up).如果所有实例都受到保护，则自动缩放不会终止任何实例（因此请注意，您始终关闭实例保护，否则您可能会在完全扩展时卡住）。