带有aws-sdk的Amazon SQS receiveMessage Stall

Question

I'm using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages. 我正在使用aws-sdk节点模块（据我所知）批准的方式来轮询消息。

Which basically sums up to: 这基本上总结为：

        sqs.receiveMessage({
            QueueUrl: queueUrl,
            MaxNumberOfMessages: 10,
            WaitTimeSeconds: 20
        }, function(err, data) {
            if (err) {
                logger.fatal('Error on Message Recieve');
                logger.fatal(err);
            } else {
                // all good
                if (undefined === data.Messages) {
                    logger.info('No Messages Object');
                } else if (data.Messages.length > 0) {
                    logger.info('Messages Count: ' + data.Messages.length);

                    var delete_batch = new Array();
                    for (var x=0;x<data.Messages.length;x++) {
                        // process
                        receiveMessage(data.Messages[x]);

                        // flag to delete

                        var pck = new Array();
                        pck['Id'] = data.Messages[x].MessageId;
                        pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;

                        delete_batch.push(pck);
                    }

                    if (delete_batch.length > 0) {
                        logger.info('Calling Delete');
                        sqs.deleteMessageBatch({
                            Entries: delete_batch,
                            QueueUrl: queueUrl
                        }, function(err, data) {
                            if (err) {
                                logger.fatal('Failed to delete messages');
                                logger.fatal(err);
                            } else {
                                logger.debug('Deleted recieved ok');
                            }
                        });
                    }
                } else {
                    logger.info('No Messages Count');
                }
            }
        });

receiveMessage is my "do stuff with collected messages if I have enough collected messages" function 如果我有足够的收集邮件功能， receiveMessage是我“收集邮件的东西”

Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called. 有时，我的脚本停滞不前，因为我根本没有获得对Amazon的响应，例如队列中没有消息要消耗，而不是点击WaitTimeSeconds并发送“no messages object”，回调是'n'叫。

(I'm writing this up to Amazon Weirdness) （我正在写这个亚马逊古怪）

What I'm asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage. 我问的是什么是检测和处理这个的最佳方法，因为我有一些代码来阻止对receiveMessage的并发调用。

The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it's only fetching one message a time) 这里建议的答案： Nodejs sqs队列处理器还有防止并发消息请求查询的代码（授予它一次只获取一条消息）

I do have the whole thing wrapped in 我确实把整件事包裹起来

var running = false;
runMonitorJob = setInterval(function() {
    if (running) {
    } else {
        running = true;
        // call SQS.receive
    }
}, 500);

(With a running = false after the delete loop (not in it's callback)) （删除循环后没有running = false（不在它的回调中））

My solution would be 我的解决方案是

watchdogTimeout = setTimeout(function() {
    running = false;
}, 30000);

But surely this would leave a pile of floating sqs.receive's lurking about and thus much memory over time? 但是，这肯定会留下一堆浮动的sqs.receive潜伏着，因而随着时间的推移会有很多记忆？

(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning) （这个工作一直在运行，我让它在星期五运行，它在星期六早上停滞不前，直到我今天早上手动重新启动工作）

Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a "no messages" after 20 seconds. 编辑：我已经看到它挂起约5分钟然后突然得到消息的情况但是等待时间为20秒它应该在20秒后抛出“无消息”。 So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic) 因此，约10分钟的WatchDog可能更实用（取决于业务逻辑的其余部分）

Edit: Yes Long Polling is already configured Queue Side. 编辑：是长轮询已配置队列端。

Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4 编辑：这是在aws-sdk和NodeJS v4.4.4的（最新）v2.3.9下

Answer 1

I've been chasing this (or a similar) issue for a few days now and here's what I've noticed: 我一直在追逐这个（或类似的）问题几天，这就是我注意到的：

The receiveMessage call does eventually return although only after 120 seconds 尽管仅在120秒之后，receiveMessage调用最终会返回
Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect. 对AWS.SDK库序列化对receiveMessage的并发调用，因此并行进行多次调用无效。
The receiveMessage callback does not error - in fact after the 120 seconds have passed, it may contain messages. receiveMessage回调没有错误 - 实际上在120秒过后，它可能包含消息。

What can be done about this? 关于这个还能做什么？ This sort of thing can happen for a number of reasons and some/many of these things can't necessarily be fixed. 这种事情可能由于多种原因而发生，并且这些事情中的一些/许多不一定能够被修复。 The answer is to run multiple services each calling receiveMessage and processing the messages as they come - SQS supports this. 答案是运行多个服务，每个服务调用receiveMessage并在消息到来时处理消息--SQS支持这一点。 At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal. 在任何时候，这些服务中的一个可能会达到120秒的延迟，但其他服务应该能够正常继续。

My particular problem is that I have some critical singleton services that can't afford 120 seconds of down time. 我特别的问题是我有一些关键的单件服务，无法承受120秒的停机时间。 For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service. 为此，我将研究1）使用HTTP而不是SQS将消息推送到我的服务中，或者2）在每个单例周围产生从属进程以从SQS获取消息并将它们推送到服务中。

Answer 2

I also ran into this issue, but not when calling receiveMessage but sendMessage. 我也遇到过这个问题，但是在调用receiveMessage而不是sendMessage时却没有。 I also saw hangups of exactly 120 seconds. 我也看到了正好120秒的挂断。 I also saw it with a few other services, like Firehose. 我还看到了其他一些服务，比如Firehose。

That lead me to this line in the AWS SDK: 这引导我进入AWS SDK中的这一行：

SQS Constructor SQS构造函数

httpOptions:

timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. timeout [Integer] - 在套接字上的超时毫秒不活动后将套接字设置为超时。 Defaults to two minutes (120000). 默认为两分钟（120000）。

to implement a fix, I override the timeout for my SQS client that performs the sendMessage to timeout after 10 seconds, and another with 25 seconds for receiving (where I long poll for 20 seconds): 为了实现修复，我覆盖了我的SQS客户端的超时，该客户端在10秒后执行sendMessage超时，另一个用25秒接收（我长时间轮询20秒）：

var sendClient    = new AWS.SQS({httpOptions:{timeout:10*1000}});
var receiveClient = new AWS.SQS({httpOptions:{timeout:25*1000}});

I've had this out in production for a week now and I've noticed that all of my SQS stalling issues have been eliminated. 我已经把它制作了一个星期了，而且我注意到我的所有SQS失速问题都已经消除了。

带有aws-sdk的Amazon SQS receiveMessage Stall

问题描述

2 个解决方案

解决方案1
1 2018-05-02 14:19:20

解决方案2
0 2018-09-14 02:15:01

带有aws-sdk的Amazon SQS receiveMessage Stall

问题描述

2 个解决方案

解决方案1 1 2018-05-02 14:19:20

解决方案2 0 2018-09-14 02:15:01

解决方案1
1 2018-05-02 14:19:20

解决方案2
0 2018-09-14 02:15:01