[英]Configure SQS Dead letter Queue to raise a cloud watch alarm on receiving a message
I was working with Dead letter Queue in Amazon SQS.我在 Amazon SQS 中使用死信队列。 I want that whenever a new message is received by the queue it should raise a CloudWatch alarm.
我希望每当队列收到新消息时,它都应该引发 CloudWatch 警报。 The problem is I configured an alarm on the metric:
number_of_messages_sent
of the queue but this metric don't work as expected in case of Dead letter Queues as mentioned in the Amazon SQS Dead-Letter Queues - Amazon Simple Queue Service documentation.问题是我在指标上配置了一个警报:队列的
number_of_messages_sent
但在Amazon SQS 死信队列 - Amazon 简单队列服务文档中提到的死信队列的情况下,该指标无法按预期工作。
Now some suggestions on this were use number_of_messages_visible
but I am not sure how to configure this in an alarm.现在关于此的一些建议是使用
number_of_messages_visible
但我不确定如何在警报中配置它。 So if i set that the value of this metric>0
then this is not same as getting a new message in the queue.因此,如果我将此
metric>0
,那么这与在队列中获取新消息不同。 If an old message is there then the metric value will always be >0
.如果存在旧消息,则度量值将始终为
>0
。 I can do some kind of mathematical expression to get the delta in this metric for some defined period (let's say a minute) but I am looking for some better solution.我可以做一些数学表达式来获得这个指标在某个定义的时间段内的增量(比方说一分钟),但我正在寻找一些更好的解决方案。
I struggled with the same problem and the answer for me was to use NumberOfMessagesSent instead.我遇到了同样的问题,我的答案是改用 NumberOfMessagesSent。 Then I could set my criteria for new messages that came in during my configured period of time.
然后我可以为在我配置的时间段内收到的新消息设置我的标准。 Here is what worked for me in CloudFormation.
这是在 CloudFormation 中对我有用的方法。
Note that individual alarms do not occur if the alarm stays in an alarm state from constant failure.请注意,如果警报因持续故障而保持警报状态,则不会发生个别警报。 You can setup another alarm to catch those.
您可以设置另一个警报来捕捉这些警报。 ie: Alarm when 100 errors occur in 1 hour using the same method.
即:使用相同的方法在 1 小时内发生 100 个错误时发出警报。
Updated: Because the metrics for NumberOfMessagesReceived and NumberOfMessagesSent are dependent on how the message is queued, I have devised a new solutions for our needs using the metric ApproximateNumberOfMessagesDelayed after adding a delay to the dlq settings.更新:由于NumberOfMessagesReceived和NumberOfMessagesSent指标都依赖于信息是如何排队,我设计了使用的度量ApproximateNumberOfMessagesDelayed将延迟到DLQ设置后,我们需要一个新的解决方案。 If you are adding the messages to the queue manually then NumberOfMessagesReceived will work.
如果您手动将消息添加到队列,则 NumberOfMessagesReceived 将起作用。 Otherwise use ApproximateNumberOfMessagesDelayed after setting up a delay.
否则在设置延迟后使用 ApproximateNumberOfMessagesDelayed。
MyDeadLetterQueue:
Type: AWS::SQS::Queue
Properties:
MessageRetentionPeriod: 1209600 # 14 days
DelaySeconds: 60 #for alarms
DLQthresholdAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: "Alarm dlq messages when we have 1 or more failed messages in 10 minutes"
Namespace: "AWS/SQS"
MetricName: "ApproximateNumberOfMessagesDelayed"
Dimensions:
- Name: "QueueName"
Value:
Fn::GetAtt:
- "MyDeadLetterQueue"
- "QueueName"
Statistic: "Sum"
Period: 300
DatapointsToAlarm: 1
EvaluationPeriods: 2
Threshold: 1
ComparisonOperator: "GreaterThanOrEqualToThreshold"
AlarmActions:
- !Ref MyAlarmTopic
It is difficult to achieve what is being asked in the question.很难实现问题中的要求。 If the endpoint of cloudwatch alarm is to send Email or notify users about the DLQ message arrival you can do a similar thing with the help of SQS, SNS and Lambda.
如果 cloudwatch 警报的端点是发送电子邮件或通知用户 DLQ 消息到达,您可以在 SQS、SNS 和 Lambda 的帮助下做类似的事情。 And from cloudwatch you can see how the DLQ messages grows on time whenever you receive any Email.
从 cloudwatch,您可以看到每当您收到任何电子邮件时,DLQ 消息如何按时增长。
#!/usr/bin/python3
import json
import boto3
import os
def lambda_handler(event, context):
batch_processes=[]
for record in event['Records']:
send_request(record["body"])
def send_request(body):
# Create SNS client
sns = boto3.client('sns')
# Publish messages to the specified SNS topic
response = sns.publish(
TopicArn=#YOUR_TOPIC_ARN
Message=body,
)
# Print out the response
print(response)
We had the same issue and solved it by using 2 metrics and creating an math expression.我们遇到了同样的问题,并通过使用 2 个指标并创建一个数学表达式来解决它。
ConsentQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: "queue"
RedrivePolicy:
deadLetterTargetArn:
Fn::GetAtt:
- "DLQ"
- "Arn"
maxReceiveCount: 3 # after 3 tries the event will go to DLQ
VisibilityTimeout: 65
DLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: "DLQ"
DLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: "SQS failed"
AlarmName: "SQSAlarm"
Metrics:
- Expression: "m2-m1"
Id: "e1"
Label: "ChangeInAmountVisible"
ReturnData: true
- Id: "m1"
Label: "MessagesVisibleMin"
MetricStat:
Metric:
Dimensions:
- Name: QueueName
Value: !GetAtt DLQ.QueueName
MetricName: ApproximateNumberOfMessagesVisible
Namespace: "AWS/SQS"
Period: 300 # evaluate maximum over period of 5 min
Stat: Minimum
Unit: Count
ReturnData: false
- Id: "m2"
Label: "MessagesVisibleMax"
MetricStat:
Metric:
Dimensions:
- Name: QueueName
Value: !GetAtt DLQ.QueueName
MetricName: ApproximateNumberOfMessagesVisible
Namespace: "AWS/SQS"
Period: 300 # evaluate maximum over period of 5 min
Stat: Maximum
Unit: Count
ReturnData: false
ComparisonOperator: GreaterThanOrEqualToThreshold
Threshold: 1
DatapointsToAlarm: 1
EvaluationPeriods: 1
The period is important so the minimum and maximum are evaluated over a longer period.周期很重要,因此最小值和最大值是在更长的时间内进行评估的。
I've encountered the same issue with Cloudwatch Alarms not firing when queue entries automatically flow into a DLQ, and believe I have come up with a solution.我遇到了同样的问题,当队列条目自动流入 DLQ 时,Cloudwatch 警报未触发,并且相信我已经提出了解决方案。
You need to setup:您需要设置:
This should now on a periodic basis, check the difference of number of entries in the DLQ, regardless of how they got there, so we can get past the problematic Metrics like NumberOfMessagesSent or NumberOfMessagesReceived.现在应该定期检查 DLQ 中条目数量的差异,无论它们是如何到达那里的,这样我们就可以通过有问题的指标,如 NumberOfMessagesSent 或 NumberOfMessagesReceived。
UPDATE: I just realised that is the exact solution that Lucasz mentioned above, so consider this a confirmation that it works :)更新:我刚刚意识到这是 Lucasz 上面提到的确切解决方案,因此请考虑确认它有效:)
Terraform working example of above mentions of RATE(M1+M2) Terraform 上面提到的 RATE(M1+M2) 的工作示例
resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
alarm_name = "alarm_name"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
threshold = "0"
alarm_description = "desc"
insufficient_data_actions = []
alarm_actions = [aws_sns_topic.sns.arn]
metric_query {
id = "e1"
expression = "RATE(m2+m1)"
label = "Error Rate"
return_data = "true"
}
metric_query {
id = "m1"
metric {
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = "60"
stat = "Sum"
unit = "Count"
dimensions = {
QueueName = "${aws_sqs_queue.sqs-dlq.name}"
}
}
}
metric_query {
id = "m2"
metric {
metric_name = "ApproximateNumberOfMessagesNotVisible"
namespace = "AWS/SQS"
period = "60"
stat = "Sum"
unit = "Count"
dimensions = {
QueueName = "${aws_sqs_queue.sqs-dlq.name}"
}
}
}
}
What you can do is create a lambda with event source as your DLQ.您可以做的是创建一个带有事件源的 lambda 作为您的 DLQ。 And from the Lambda you can post custom metric data to CloudWatch.
您可以从 Lambda 将自定义指标数据发布到 CloudWatch。 Alarm will be triggered when your data meets the conditions.
当您的数据符合条件时将触发警报。
Use this reference to configure your lambda such that it gets triggered when a message is sent to your DLQ: Using AWS Lambda with Amazon SQS - AWS Lambda使用此参考配置您的 lambda,以便在将消息发送到您的 DLQ 时触发它:将 AWS Lambda 与 Amazon SQS 结合使用 - AWS Lambda
Here is a nice explanation with code that suggests how we can post custom metrics from Lambda to CloudWatch: Sending CloudWatch Custom Metrics From Lambda With Code Examples这是一个很好的代码解释,建议我们如何将自定义指标从 Lambda 发布到 CloudWatch:使用代码示例从 Lambda 发送 CloudWatch 自定义指标
Once the metrics are posted, CloudWatch alarm will trigger as it will match the metrics.发布指标后,CloudWatch 警报将触发,因为它将与指标匹配。
I used metric math function RATE
to trigger an alarm whenever a message arrives in the dead letter queue.我使用度量数学函数
RATE
在消息到达死信队列时触发警报。
Select two metrics ApproximateNumberOfMessagesVisible
and ApproximateNumberOfMessagesNotVisible
for your dead letter queue.为死信队列选择两个指标
ApproximateNumberOfMessagesVisible
和ApproximateNumberOfMessagesNotVisible
。
Configure the metric expression as RATE(m1+m2)
, set the threshold to 0
and select the comparison operator as GreaterThanThreshold
.将度量表达式配置为
RATE(m1+m2)
,将阈值设置为0
并选择比较运算符为GreaterThanThreshold
。
m1+m2
is the total number of messages in the queue at a given time. m1+m2
是给定时间队列中的消息总数。 Whenever a new message arrives in the queue the rate of this expression will go above then zero.每当新消息到达队列时,此表达式的速率将高于零。 That's how it works.
这就是它的工作原理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.