简体   繁体   English

并行轮询AWS SQS标准队列-消息处理太慢

[英]Parallel polling of the AWS SQS standard queue - Message processing is too slow

I've a module that polls an AWS SQS queue at specified intervals with one message at a time with ReceiveMessageRequest . 我有一个模块,它使用ReceiveMessageRequest在指定的时间间隔一次以一条消息轮询AWS SQS队列。 Following is the method: 以下是方法:

public static ReceiveMessageResult receiveMessageFromQueue() {

    String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
    ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(targetedQueueUrl)
            .withWaitTimeSeconds(10).withMaxNumberOfMessages(1);
    return sqsClient.receiveMessage(receiveMessageRequest);
}

Once a message is received and processed its get deleted from the queue with the DeleteMessageResult . 一旦收到并处理了一条消息,便使用DeleteMessageResult从队列中将其删除。

public static DeleteMessageResult deleteMessageFromQueue(String receiptHandle) {

    log.info("Deleting Message with receipt handle - [{}]", receiptHandle);
    String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
    return sqsClient.deleteMessage(new DeleteMessageRequest(targetedQueueUrl, receiptHandle));

}

I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. 我创建了一个可执行的jar文件,该文件部署在大约40个实例中,并且正在主动轮询队列。 I could see each of them receives messages. 我可以看到他们每个人都收到消息。 But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. 但是在AWS SQS控制台中,我在“飞行消息”列中只能看到数字0、1、2或3。 Why that so even when there are 40+ different consumers are receiving messages from the queue? 为什么即使有40多个不同的使用者都从队列中接收消息,也为什么呢? Also the number of messages available in the queue reduces very slowly. 同样,队列中可用消息的数量减少得非常缓慢。

Following are the configuration parameters of the queue. 以下是队列的配置参数。

Default Visibility Timeout: 30 seconds
Message Retention Period:   4 days
Maximum Message Size:   256 KB
Receive Message Wait Time:  0 seconds
Messages Available (Visible):   4,776
Delivery Delay: 0 seconds
Messages in Flight (Not Visible):   2
Queue Type: Standard
Messages Delayed:   0
Content-Based Deduplication:    N/A 

Why the messages are not getting processed quickly even when there are multiple consumers? 为什么即使有多个使用者也无法快速处理邮件? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? 我是否需要修改任何队列参数或接收消息/删除消息请求中的某些内容? Please advise. 请指教。

UPDATE: 更新:

All the EC2 instances and the SQS are in the same region. 所有EC2实例和SQS都位于同一区域中。 The consumers (jar file that polls the queue) run as part of the start-up script of the EC2 instance. 使用者(轮询队列的jar文件)作为EC2实例启动脚本的一部分运行。 And it is having a scheduled task that polls the queue every 12 seconds. 而且它有一个计划的任务,每12秒轮询一次队列。 Before I push the messages to the queue I spun up 2-3 instances. 在将消息推送到队列之前,我启动了2-3个实例。 (We may have some already running instances at that time - this adds up the number of receivers(Caped to 50) for the queue. On receiving the message it will do some tasks (including some DB operations, data analysis and calculations, report file generation and upload the report to S3 etc..) and It'll take approx. 10-12 seconds. After that's done it deletes the message from the queue. Below image is the screenshot of the SQS metrics for last 1 week (from SQS monitoring console). (当时我们可能有一些已经在运行的实例-这将增加队列的接收方数量(上限为50)。收到消息后,它将执行一些任务(包括一些数据库操作,数据分析和计算,报告文件)生成报告并将其上传到S3等。)大约需要10到12秒。完成后,它将删除队列中的消息。下图是过去1周的SQS指标的屏幕截图(来自SQS)监控控制台)。

过去1周内目标队列的SQS指标

I'll do the best I can with the information given. 我会尽力提供所提供的信息。 More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer. 有关处理循环逻辑,区域设置和指标的更多详细信息(请参阅下文)将有助于改善此答案。

I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. 我创建了一个可执行的jar文件,该文件部署在大约40个实例中,并且正在主动轮询队列。 I could see each of them receives messages. 我可以看到他们每个人都收到消息。 But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. 但是在AWS SQS控制台中,我在“飞行消息”列中只能看到数字0、1、2或3。 Why that so even when there are 40+ different consumers are receiving messages from the queue? 为什么即使有40多个不同的使用者都从队列中接收消息,也为什么呢? Also the number of messages available in the queue reduces very slowly. 同样,队列中可用消息的数量减少得非常缓慢。

Why the messages are not getting processed quickly even when there are multiple consumers? 为什么即使有多个使用者也无法快速处理邮件? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? 我是否需要修改任何队列参数或接收消息/删除消息请求中的某些内容?

The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are. 您没有看到与正在处理邮件的主机数量更紧密相关的机上号码这一事实肯定说明了一个问题-邮件处理速度很快(似乎并非如此)或您的房东没有做您认为应该做的工作。

In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. 通常,从SQS中获取和删除单个消息应花费几毫秒的时间。 Without more detail on your setup, this should get you started on troubleshooting. 如果没有更多详细的设置,这应该可以帮助您开始进行故障排除。 ( Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into. ) 其中一些步骤可能看起来很明显,但是其中每一个步骤都是我所见过的开发人员遇到的现实生活问题的源。

  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. 如果您要为每个receive-process-delete启动一个新进程,那么这种开销将使您的速度大大降低。 I'll assume you're not doing this, and each host is running a loop within a single process 我假设您没有这样做,并且每个主机都在一个进程中运行一个循环
  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case). 确认您的处理循环不会令人讨厌并重新启动您(有效地将其转化为上述情况)。
    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing. 我假设您还验证了您的流程在消息处理之外也没有做很多工作。
  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host. 您应该生成一些客户端指标,以指示SQS请求在每个主机上花费多长时间。
    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful. Cloudwatch将部分为您执行此操作,但是实际的客户端指标始终很有用。
    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters 建议以下基本指标:(1)接收等待时间,(2)进程等待时间,(3)删除等待时间,(4)整个消息循环等待时间(5)成功/失败计数器
  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. 您的EC2实例(进行处理的主机)应与SQS队列位于同一区域。 If you're doing cross-region calls, this will impact your latency. 如果您正在进行跨区域呼叫,这将影响您的延迟。
    • Make sure these hosts have adequate CPU/memory resources to do the processing 确保这些主机具有足够的CPU /内存资源来进行处理
    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better. 作为一种优化,我建议每台主机使用更多的线程,而更少的主机-重用客户端连接并最大程度地利用计算资源总是更好的选择。
  5. Verify there wasn't some outage or ongoing issue when you were running your test 验证运行测试时没有任何中断或持续问题
  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. 在某些初始化步骤中,仅在应用程序的生命周期内执行一次getQueueUrl You don't need to call this repeatedly, as it'll be the same URL 您不需要重复调​​用它,因为它是相同的URL
    • This was actually the first thing I noticed in your code , but it's way down here because the above issues will have more impact if they are the cause. 实际上,这是我在您的代码中注意到的第一件事 ,但是它落到了这里,因为如果是上述问题,它们将产生更大的影响。
  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. 如果您的消息处理时间非常短(比检索和删除消息所需的时间短),那么最终您的主机将花费大部分时间来提取消息。 Metrics on this are important too. 衡量指标也很重要。
    • In this case, you should probably do batch fetching instead of one-at-a-time. 在这种情况下,您可能应该批量提取而不是一次提取。
    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case. 根据队列中消息的数量以及运行缓慢的评论,听起来好像并非如此。
  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point) 验证所有主机实际上都在同一个队列中(而不是某些beta / gamma版本或您曾经用于测试的旧版本)

Further note: 进一步说明:

  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong . 另一个答案表明可见性超时是潜在的原因- 这完全是错误的 Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message. 可见性超时阻塞队列-它只会影响的消息保持“飞行”之前,另一receiveMessageRequest可以接收消息。
  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors. 如果您想在出现错误/处理器速度较慢时尝试尽快重新处理消息,则可以考虑减少此开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM