配置 SQS 死信队列以在收到消息时发出云监视警报

Question

I was working with Dead letter Queue in Amazon SQS.我在 Amazon SQS 中使用死信队列。 I want that whenever a new message is received by the queue it should raise a CloudWatch alarm.我希望每当队列收到新消息时，它都应该引发 CloudWatch 警报。 The problem is I configured an alarm on the metric: number_of_messages_sent of the queue but this metric don't work as expected in case of Dead letter Queues as mentioned in the Amazon SQS Dead-Letter Queues - Amazon Simple Queue Service documentation.问题是我在指标上配置了一个警报：队列的number_of_messages_sent但在Amazon SQS 死信队列 - Amazon 简单队列服务文档中提到的死信队列的情况下，该指标无法按预期工作。

Now some suggestions on this were use number_of_messages_visible but I am not sure how to configure this in an alarm.现在关于此的一些建议是使用number_of_messages_visible但我不确定如何在警报中配置它。 So if i set that the value of this metric>0 then this is not same as getting a new message in the queue.因此，如果我将此metric>0 ，那么这与在队列中获取新消息不同。 If an old message is there then the metric value will always be >0 .如果存在旧消息，则度量值将始终为>0 。 I can do some kind of mathematical expression to get the delta in this metric for some defined period (let's say a minute) but I am looking for some better solution.我可以做一些数学表达式来获得这个指标在某个定义的时间段内的增量（比方说一分钟），但我正在寻找一些更好的解决方案。

Answer 1

I struggled with the same problem and the answer for me was to use NumberOfMessagesSent instead.我遇到了同样的问题，我的答案是改用 NumberOfMessagesSent。 Then I could set my criteria for new messages that came in during my configured period of time.然后我可以为在我配置的时间段内收到的新消息设置我的标准。 Here is what worked for me in CloudFormation.这是在 CloudFormation 中对我有用的方法。

Note that individual alarms do not occur if the alarm stays in an alarm state from constant failure.请注意，如果警报因持续故障而保持警报状态，则不会发生个别警报。 You can setup another alarm to catch those.您可以设置另一个警报来捕捉这些警报。 ie: Alarm when 100 errors occur in 1 hour using the same method.即：使用相同的方法在 1 小时内发生 100 个错误时发出警报。

Updated: Because the metrics for NumberOfMessagesReceived and NumberOfMessagesSent are dependent on how the message is queued, I have devised a new solutions for our needs using the metric ApproximateNumberOfMessagesDelayed after adding a delay to the dlq settings.更新：由于NumberOfMessagesReceived和NumberOfMessagesSent指标都依赖于信息是如何排队，我设计了使用的度量ApproximateNumberOfMessagesDelayed将延迟到DLQ设置后，我们需要一个新的解决方案。 If you are adding the messages to the queue manually then NumberOfMessagesReceived will work.如果您手动将消息添加到队列，则 NumberOfMessagesReceived 将起作用。 Otherwise use ApproximateNumberOfMessagesDelayed after setting up a delay.否则在设置延迟后使用 ApproximateNumberOfMessagesDelayed。

MyDeadLetterQueue:
    Type: AWS::SQS::Queue
    Properties:
      MessageRetentionPeriod: 1209600  # 14 days
      DelaySeconds: 60 #for alarms

DLQthresholdAlarm:
 Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm dlq messages when we have 1 or more failed messages in 10 minutes"
      Namespace: "AWS/SQS"
      MetricName: "ApproximateNumberOfMessagesDelayed"
      Dimensions:
        - Name: "QueueName"
          Value:
            Fn::GetAtt:
              - "MyDeadLetterQueue"
              - "QueueName"
      Statistic: "Sum"
      Period: 300  
      DatapointsToAlarm: 1 
      EvaluationPeriods: 2       
      Threshold: 1
      ComparisonOperator: "GreaterThanOrEqualToThreshold"
      AlarmActions:
        - !Ref MyAlarmTopic

Answer 2

It is difficult to achieve what is being asked in the question.很难实现问题中的要求。 If the endpoint of cloudwatch alarm is to send Email or notify users about the DLQ message arrival you can do a similar thing with the help of SQS, SNS and Lambda.如果 cloudwatch 警报的端点是发送电子邮件或通知用户 DLQ 消息到达，您可以在 SQS、SNS 和 Lambda 的帮助下做类似的事情。 And from cloudwatch you can see how the DLQ messages grows on time whenever you receive any Email.从 cloudwatch，您可以看到每当您收到任何电子邮件时，DLQ 消息如何按时增长。

Create a SQS DLQ for an existing queue.为现有队列创建 SQS DLQ。
Create an SNS topic and subscribe the SNS topic to send Email.创建 SNS 主题并订阅 SNS 主题以发送电子邮件。
Create a small lambda function which listens the SQS queue for an incoming messages, if there is any new incoming messages, send it to SNS.创建一个小的 lambda 函数，它监听 SQS 队列中的传入消息，如果有任何新传入消息，则将其发送到 SNS。 Since SNS is subscribed to Email you will get the Email whenever any new messages comes to SQS queue.由于 SNS 订阅了电子邮件，因此每当有任何新消息进入 SQS 队列时，您都会收到电子邮件。 Obviously the trigger for the lambda function is SQS and batch size is 1.显然 lambda 函数的触发器是 SQS，批大小为 1。

#!/usr/bin/python3
import json
import boto3
import os

def lambda_handler(event, context):
    batch_processes=[]
    for record in event['Records']:
        send_request(record["body"])


def send_request(body):
    # Create SNS client
    sns = boto3.client('sns')

    # Publish messages to the specified SNS topic
    response = sns.publish(
        TopicArn=#YOUR_TOPIC_ARN
        Message=body,    
    )

    # Print out the response
    print(response)

Answer 3

We had the same issue and solved it by using 2 metrics and creating an math expression.我们遇到了同样的问题，并通过使用 2 个指标并创建一个数学表达式来解决它。

    ConsentQueue:
        Type: AWS::SQS::Queue
        Properties:
            QueueName: "queue"
            RedrivePolicy:
                deadLetterTargetArn:
                    Fn::GetAtt:
                        - "DLQ"
                        - "Arn"
                maxReceiveCount: 3 # after 3 tries the event will go to DLQ
             VisibilityTimeout: 65
    DLQ:
        Type: AWS::SQS::Queue
        Properties:
            QueueName: "DLQ"

    DLQAlarm:
        Type: AWS::CloudWatch::Alarm
        Properties:
            AlarmDescription: "SQS failed"
            AlarmName: "SQSAlarm"
            Metrics:
            - Expression: "m2-m1"
              Id: "e1"
              Label: "ChangeInAmountVisible"
              ReturnData: true
            - Id: "m1"
              Label: "MessagesVisibleMin"
              MetricStat:
                  Metric:
                      Dimensions:
                      - Name: QueueName
                        Value: !GetAtt DLQ.QueueName
                      MetricName: ApproximateNumberOfMessagesVisible
                      Namespace: "AWS/SQS"
                  Period: 300 # evaluate maximum over period of 5 min
                  Stat: Minimum
                  Unit: Count
              ReturnData: false
            - Id: "m2"
              Label: "MessagesVisibleMax"
              MetricStat:
                  Metric:
                      Dimensions:
                      - Name: QueueName
                        Value: !GetAtt DLQ.QueueName
                      MetricName: ApproximateNumberOfMessagesVisible
                      Namespace: "AWS/SQS"
                  Period: 300 # evaluate maximum over period of 5 min
                  Stat: Maximum
                  Unit: Count
              ReturnData: false
            ComparisonOperator: GreaterThanOrEqualToThreshold
            Threshold: 1
            DatapointsToAlarm: 1
            EvaluationPeriods: 1

The period is important so the minimum and maximum are evaluated over a longer period.周期很重要，因此最小值和最大值是在更长的时间内进行评估的。

Answer 4

I've encountered the same issue with Cloudwatch Alarms not firing when queue entries automatically flow into a DLQ, and believe I have come up with a solution.我遇到了同样的问题，当队列条目自动流入 DLQ 时，Cloudwatch 警报未触发，并且相信我已经提出了解决方案。

You need to setup:您需要设置：

Consider a time period, for me I set up 5 minutes考虑一个时间段，对我来说我设置了 5 分钟
Add a metric via the SQS collection for the dlq you need, and select "ApproximateNumberOfMessagesVisible".通过 SQS 集合为您需要的 dlq 添加一个指标，然后选择“ApproximateNumberOfMessagesVisible”。 Set the statistics to Maximum.将统计数据设置为最大值。
Duplicate the above line, and set the statistics to Minimum.复制上面的行，并将统计信息设置为最小值。
Add a new empty expression Metric where the details are: (the id of maximum metric) - (the id of the minimum metric)添加一个新的空表达式 Metric ，其中详细信息是：（最大度量的 id） - （最小度量的 id）
Make sure you only tick and click "Select Metric" for the new expression you created above.确保您只勾选并单击您在上面创建的新表达式的“选择指标”。

This should now on a periodic basis, check the difference of number of entries in the DLQ, regardless of how they got there, so we can get past the problematic Metrics like NumberOfMessagesSent or NumberOfMessagesReceived.现在应该定期检查 DLQ 中条目数量的差异，无论它们是如何到达那里的，这样我们就可以通过有问题的指标，如 NumberOfMessagesSent 或 NumberOfMessagesReceived。

UPDATE: I just realised that is the exact solution that Lucasz mentioned above, so consider this a confirmation that it works :)更新：我刚刚意识到这是 Lucasz 上面提到的确切解决方案，因此请考虑确认它有效:)

Answer 5

Terraform working example of above mentions of RATE(M1+M2) Terraform 上面提到的 RATE(M1+M2) 的工作示例

resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
  alarm_name                = "alarm_name"
  comparison_operator       = "GreaterThanThreshold"
  evaluation_periods        = "1"
  threshold                 = "0"
  alarm_description         = "desc"
  insufficient_data_actions = []
  alarm_actions = [aws_sns_topic.sns.arn]

  metric_query {
    id          = "e1"
    expression  = "RATE(m2+m1)"
    label       = "Error Rate"
    return_data = "true"
  }

  metric_query {
    id = "m1"

    metric {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace                 = "AWS/SQS"
      period      = "60"
      stat        = "Sum"
      unit        = "Count"

      dimensions = {
        QueueName    = "${aws_sqs_queue.sqs-dlq.name}"
      }
    }
  }

  metric_query {
    id = "m2"

    metric {
      metric_name = "ApproximateNumberOfMessagesNotVisible"
      namespace                 = "AWS/SQS"
      period      = "60"
      stat        = "Sum"
      unit        = "Count"

      dimensions = {
        QueueName    = "${aws_sqs_queue.sqs-dlq.name}"
      }
    }
  }
}

Answer 6

What you can do is create a lambda with event source as your DLQ.您可以做的是创建一个带有事件源的 lambda 作为您的 DLQ。 And from the Lambda you can post custom metric data to CloudWatch.您可以从 Lambda 将自定义指标数据发布到 CloudWatch。 Alarm will be triggered when your data meets the conditions.当您的数据符合条件时将触发警报。

Use this reference to configure your lambda such that it gets triggered when a message is sent to your DLQ: Using AWS Lambda with Amazon SQS - AWS Lambda使用此参考配置您的 lambda，以便在将消息发送到您的 DLQ 时触发它：将 AWS Lambda 与 Amazon SQS 结合使用 - AWS Lambda

Here is a nice explanation with code that suggests how we can post custom metrics from Lambda to CloudWatch: Sending CloudWatch Custom Metrics From Lambda With Code Examples这是一个很好的代码解释，建议我们如何将自定义指标从 Lambda 发布到 CloudWatch：使用代码示例从 Lambda 发送 CloudWatch 自定义指标

Once the metrics are posted, CloudWatch alarm will trigger as it will match the metrics.发布指标后，CloudWatch 警报将触发，因为它将与指标匹配。

Answer 7

I used metric math function RATE to trigger an alarm whenever a message arrives in the dead letter queue.我使用度量数学函数RATE在消息到达死信队列时触发警报。

Select two metrics ApproximateNumberOfMessagesVisible and ApproximateNumberOfMessagesNotVisible for your dead letter queue.为死信队列选择两个指标ApproximateNumberOfMessagesVisible和ApproximateNumberOfMessagesNotVisible 。

Configure the metric expression as RATE(m1+m2) , set the threshold to 0 and select the comparison operator as GreaterThanThreshold .将度量表达式配置为RATE(m1+m2) ，将阈值设置为0并选择比较运算符为GreaterThanThreshold 。

m1+m2 is the total number of messages in the queue at a given time. m1+m2是给定时间队列中的消息总数。 Whenever a new message arrives in the queue the rate of this expression will go above then zero.每当新消息到达队列时，此表达式的速率将高于零。 That's how it works.这就是它的工作原理。

配置 SQS 死信队列以在收到消息时发出云监视警报

问题描述

7 个解决方案

解决方案1
5 2020-05-29 23:57:23

解决方案2
2 2020-05-30 07:58:59

解决方案3
2 2020-12-22 07:21:22

解决方案4
1 2021-01-21 02:11:13

解决方案5
1 2022-03-03 17:27:21

解决方案6
0 2020-02-13 20:35:36

解决方案7
0 2020-11-12 10:42:04

配置 SQS 死信队列以在收到消息时发出云监视警报

问题描述

7 个解决方案

解决方案1 5 2020-05-29 23:57:23

解决方案2 2 2020-05-30 07:58:59

解决方案3 2 2020-12-22 07:21:22

解决方案4 1 2021-01-21 02:11:13

解决方案5 1 2022-03-03 17:27:21

解决方案6 0 2020-02-13 20:35:36

解决方案7 0 2020-11-12 10:42:04

解决方案1
5 2020-05-29 23:57:23

解决方案2
2 2020-05-30 07:58:59

解决方案3
2 2020-12-22 07:21:22

解决方案4
1 2021-01-21 02:11:13

解决方案5
1 2022-03-03 17:27:21

解决方案6
0 2020-02-13 20:35:36

解决方案7
0 2020-11-12 10:42:04