简体繁体 English

调试JBoss 100％的CPU使用率

[英]debugging JBoss 100% CPU usage

原文 2010-03-15 19:15:41 4 4 java/ debugging/ jboss/ web-applications/ cpu-usage

Originally posted on Server Fault , where it was suggested this question might better asked here. 最初发布在Server Fault上，有人建议在这里提出这个问题。

We are using JBoss to run two of our WARs. 我们使用JBoss来运行两个WAR。 One is our web app, the other is our web service. 一个是我们的网络应用程序，另一个是我们的Web服务。 The web app accesses a database on another machine and makes requests to the web service. Web应用程序访问另一台计算机上的数据库并向Web服务发出请求。 The web service makes JMS requests to other machines, aggregates the data, and returns it. Web服务向其他计算机发出JMS请求，聚合数据并返回它。

At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. 在我们最大的客户端，大约每月一次，JBoss Java进程占用了所有CPU的100％。 The machine running JBoss has 8 CPUs. 运行JBoss的机器有8个CPU。 Our web app is still accessible during this time, however pages take about 3 minutes to load. 我们的网络应用程序在此期间仍可访问，但页面加载大约需要3分钟。 Restarting JBoss restores everything to normal. 重启JBoss会恢复正常。

The database machine and all the other machines are fine, only the machine running JBoss is affected. 数据库机器和所有其他机器都很好，只有运行JBoss的机器受到影响。 Memory usage is normal. 内存使用情况正常。 Network utilization is normal. 网络利用率是正常的。 There are no suspect error messages in the JBoss logs. JBoss日志中没有可疑的错误消息。

I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. 我已经建立了一个尽可能接近客户端生产环境的测试环境，并且我已经完成了高达2倍并发用户数的负载测试。 I have not gotten my test environment to replicate the problem. 我没有得到我的测试环境来复制问题。

Where do we go from here? 我们从哪里去？ How can we narrow down the problem? 我们怎样才能缩小问题？

Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. 目前我们唯一的计划是等到生产中出现问题，然后进行一些调试以确定原因。 So far people have just restarted JBoss when the problem occurred to minimize down time. 到目前为止，人们刚刚在问题发生时重新启动了JBoss，以尽量减少停机时间。 Next time it happens they will get a developer to take a look. 下次它发生时，他们会让开发人员看一看。 The question is, next time it happens, what can be done to determine the cause? 问题是，下次发生时，可以采取哪些措施来确定原因？

We could setup a separate JBoss instance on the same box and install the web app separately from the web service. 我们可以在同一个盒子上设置一个单独的JBoss实例，并与Web服务分开安装Web应用程序。 This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). 这样，当下一个问题发生时，我们将知道哪个WAR有问题（假设它是我们的代码）。 This doesn't narrow it down much though. 尽管如此，这并没有缩小范围。

Should I enable JMX remote? 我应该启用JMX遥控器吗？ This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. 这种方式下次出现问题时我可以与VisualVM连接，看看哪些线程正在占用CPU以及他们到底在做什么。 However, is there a significant down side to enabling JMX remote in a production environment? 但是，在生产环境中启用JMX远程是否存在重大缺陷？

Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing? 是否有另一种方法可以查看哪些线程正在占用CPU并获得堆栈跟踪以查看它们在做什么？

Any other ideas? 还有其他想法吗？

Thanks! 谢谢！

4 个解决方案

There's a quick and dirty way of identifying which threads are using up the CPU time on JBoss. 有一种快速而又脏的方法可以识别哪些线程占用了JBoss上的CPU时间。 Go the the JMX Console with a browser (usually on http://localhost:8080/jmx-console , but may be different for you), look for a bean called ServerInfo , it has an operation called listThreadCpuUtilization which dumps the actual CPU time used by each active thread, in a nice tabular format. 使用浏览器访问JMX控制台（通常在http：// localhost：8080 / jmx-console ，但可能与您不同），查找名为ServerInfo的bean，它有一个名为listThreadCpuUtilization的操作，它会转储实际的CPU时间每个活动线程使用的表格格式很好。 If there's one misbehaving, it usually stands out like a sore thumb. 如果有一个行为不端，它通常会像拇指一样突出。

There's also the listThreadDump operation which dumps the stack for every thread to the browser. 还有listThreadDump操作，它将每个线程的堆栈转储到浏览器。

Not as good as a profiler, but a much easier way to get the basic information. 不如分析器好，但更容易获得基本信息。 For production servers, where it's often bad news to connect a profiler, it's very handy. 对于生产服务器来说，连接分析器通常是个坏消息，它非常方便。

This typically happens with runaway code or unsafe thread access to hashmaps. 这通常发生在失控代码或对散列图的不安全线程访问中。 A simple thread dump (kill -3, as @disown says, or ctrl-break in a windows console) will reveal this problem. 一个简单的线程转储（kill -3，如@disown所说，或者在Windows控制台中使用ctrl-break）将揭示这个问题。

Since you're unable to reproduce it using tests I think it smells like a concurrency issue; 由于你无法使用测试重现它，我认为它闻起来像一个并发问题; it's usually hard to make test scripts behave sufficiently random to catch issues of this type. 通常很难使测试脚本表现得足够随机以捕获此类问题。

I normally try to make it standard operating procedure to do thread-dumps of any JVM that is restarted due to operational anomalies, and it's really a requirement to catch those once-a-month things. 我通常会尝试使其成为标准操作过程来执行由于操作异常而重新启动的任何 JVM的线程转储，并且实际上需要捕获每月一次的事情。

I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. 我认为你应该尝试设置一个带有负载测试的测试环境，以便重现你的问题。 Profiling would definitely help in order to pinpoint the problem. 分析肯定有助于查明问题。

A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. 快速解决方法是下次使用kill -3杀死jboss以获得转储分析。 Second thing I would check is that you are running with -server flags and that your gc settings are sane. 我要检查的第二件事是你运行-server标志，你的gc设置是理智的。 You could also just run some dstat to see what the process is doing during the lockup. 您还可以运行一些dstat来查看进程在锁定期间正在执行的操作。 But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this. 但同样 - 只需设置负载测试环境（通过EC2左右）来重现这一点就更安全了。

If you are using JBoss 5.1.0 EAP, there is a bug in Jboss and they also have a fix. 如果您使用的是JBoss 5.1.0 EAP，Jboss中存在一个错误，他们也有一个修复程序。 Here is the URL: https://issues.jboss.org/browse/JBPAPP-5193 这是URL： https ： //issues.jboss.org/browse/JBPAPP-5193