在生产中进行定期线程转储是否昂贵？

Question

We have a Java app in production where a certain thread is stalling/backed up. 我们在生产环境中有一个Java应用程序，其中某个线程正在暂停/备份。 The thread is reading off a queue and we measure how long it takes for an inserted task to be processed . 线程正在读取队列，我们测量插入的任务要花多长时间。 What's the best way to go about debugging the root cause. 调试根本原因的最佳方法是什么。 Would taking peroidic thread dumps (via script) every minute or so provide more information? 每分钟（通过脚本）进行周期性线程转储是否会提供更多信息？ What have others done to debug such situations. 其他人为调试此类情况做了什么。

Answer 1

In the system I work on, we monitor the time it takes for tasks to execute. 在我工作的系统中，我们监视执行任务所花费的时间。 If this time exceeds X amount of time, we trigger a thread dump (programatically, from the point where we measure the time, so not an external script), followed by another thread dump a few seconds later. 如果此时间超过X的时间量，我们将触发线程转储（从编程上是从我们测量时间的角度出发，而不是外部脚本），然后在几秒钟后触发另一个线程转储。 This threshold X should be a relatively large number, in our case it is 5 minutes. 该阈值X应该是一个相对较大的数字，在我们的示例中是5分钟。 If this occurs, we can assume that the system is not "just slow", but something bad happened, like a deadlock or an extremely long blocking call. 如果发生这种情况，我们可以假设系统不是“缓慢”的，而是发生了一些不良情况，例如死锁或非常长的阻塞调用。

So, to answer part of your question: yes, periodic thread dumps could help, but only if the dumps are from the exact moment the event you are looking for occured. 因此，回答部分问题：是的，定期线程转储可能会有所帮助，但前提是转储是在您要查找的事件发生的那一刻开始的。 If you just generate a thread dump every 10 seconds, finding the correct dump could be a pain... Unless if you are looking for a deadlock, tools can help with that. 如果仅每10秒生成一次线程转储，那么找到正确的转储可能会很麻烦...除非您正在寻找死锁，否则工具可以帮助您。 I can't answer the performance part of your question. 我无法回答您问题的表现部分。

Answer 2

Assuming your thread is pulling tasks off a simple concurrent Queue implementation I'd start by checking whether garbage collection is the culprit. 假设您的线程正在从一个简单的并发Queue实现中撤出任务，我将首先检查垃圾回收是否是罪魁祸首。 If you're not already doing so you'll want to add command line options to turn on GC logging: 如果您尚未这样做，则需要添加命令行选项以打开GC日志记录：

-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCDateStamps -Xloggc:<some-file>

If you're sure it's not garbage collection then you could consider using something like jHiccup to monitor stalls particularly if your application is running in a virtualized environment. 如果确定不是垃圾收集，则可以考虑使用jHiccup之类的工具来监视停顿，尤其是在您的应用程序在虚拟化环境中运行时。

Answer 3

Taking periodic thread dumps in production is an expensive operation for sure. 当然，在生产中进行定期线程转储是一项昂贵的操作。 Since you are giving additional task to JVM to print all the current execution stacks of all the threads which are spawned by the JVM. 由于您要给JVM附加任务，以打印JVM产生的所有线程的所有当前执行堆栈。

If you have the access to the code, i would advice you to either have additional logging to print your performance logs and timing data. 如果您可以访问该代码，我建议您使用其他日志记录来打印性能日志和计时数据。

If you don't have access to code, I would recommend using an APM tool like dynatrace, appdynamics or anything which is available to debug the time taking method or third party call. 如果您无权访问代码，我建议使用APM工具，例如dynatrace，appdynamics或任何可用于调试耗时方法或第三方调用的工具。

Hope this helps! 希望这可以帮助！

Regards, Eby J 此致，Eby J

在生产中进行定期线程转储是否昂贵？

问题描述

3 个解决方案

解决方案1
1 2018-04-04 18:14:08

解决方案2
1 2018-04-04 20:07:33

解决方案3
0 2018-04-05 16:36:32

在生产中进行定期线程转储是否昂贵？

问题描述

3 个解决方案

解决方案1 1 2018-04-04 18:14:08

解决方案2 1 2018-04-04 20:07:33

解决方案3 0 2018-04-05 16:36:32

解决方案1
1 2018-04-04 18:14:08

解决方案2
1 2018-04-04 20:07:33

解决方案3
0 2018-04-05 16:36:32