简体   繁体   English

Node.js集群-检测工作人员卡住了吗?

[英]Node.js Cluster - detect worker stuck?

I'm using node.js with cluster, typically with 2 cpu's which translate to one master and two workers. 我正在将node.js与群集一起使用,通常使用2个cpu转换为一个master和两个worker。 I am having a sneaky problem, where occasionaly (very rarely), one of the workers gets 'stuck' for some reason, and the other bares all of the load. 我遇到一个偷偷摸摸的问题,偶尔(很少),其中一名工人由于某种原因被“卡住”,而另一名工人则承担了所有的负担。 I am not sure of the cause and still investigating (no memory leak, no stack overflow, no exception). 我不确定原因并仍在调查(无内存泄漏,无堆栈溢出,无异常)。

When looking at the processes using top bash command on linux, I can clearly see that one of the node processes is steady at 100% cpu load. 在Linux上使用top bash命令查看进程时,我可以清楚地看到其中一个节点进程在100%cpu负载下稳定。

What I want to ask of you guys today, is whether you know of a way to detect this situation (when one worker is at 100%) so I can kill it off. 我今天想问的是你们是否知道一种检测这种情况的方法(当一名工人为100%时),以便我可以杀死它。

Check out usage package. 查看用法包。 Something like this should work. 这样的事情应该起作用。 I skipped cluster and worker setup. 我跳过了群集和工作程序设置。

var usage = require('usage');    
setInterval(function() {
    usage.lookup(worker.process.pid, function(err, result) {
        console.log(result);
        if(result.cpu > 90){
            worker.kill();
        }
    });
}, 5000)

OK, So here goes. 好,就这样。 Turns out my worker gets absolutely stuck. 原来我的工人被卡住了。 Don't know why, but it may be a cluster problem (what you call a cluster %^&$) Anyway, I had to monitor workers by the master. 不知道为什么,但是这可能是一个集群问题(您称为集群%^&$),无论如何,我必须由主服务器监视工人。 What I did is use cron to report from each worker every minute to the master, like so: 我所做的是使用cron每分钟从每个工作人员向主报告,如下所示:

process.send({id:cluster.worker.id}) process.send({id:cluster.worker.id})

The master would receive that message and know that this worker is alive and well. 船长会收到该消息,并知道该工人还活着并且健康。 The master then keeps a count of missing worker responses. 然后,主服务器将保留丢失的工作人员响应的计数。 After 5 minutes, the worker is killed if the count reaches 0 (decremented once every minute) 5分钟后,如果计数达到0(每分钟减少一次),则会杀死该工人。

This is how I achieved (my own) goal of killing a stuck worker after a few minutes. 这就是我几分钟后实现(自己的)杀死卡住工人的目标的方式。 This is not a complete solution, and I still don't know what causes the workers to get stuck without any exception. 这不是一个完整的解决方案,我仍然不知道是什么原因导致工人毫无例外地陷入困境。 But that is life right now. 但这就是现在的生活。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM