简体   繁体   English

JVM会定期挂起

[英]JVM periodically hangs

Trying to debug a misbehaving Java VM. 试图调试行为不端的Java VM。 The process in question is a large VM (100GB heap) running Sun VM 1.6u24 on Centos 5 that is doing routine back-end work - ie database access, file I/O and so forth. 有问题的过程是在Centos 5上运行Sun VM 1.6u24的大型VM(100GB堆),它正在执行例行的后端工作 - 即数据库访问,文件I / O等。

After the process was restarted for a software version upgrade, we noticed that its throughput has dropped significantly. 重新启动进程以进行软件版本升级后,我们发现其吞吐量已显着下降。 Most of the time, top reports the Java process is fully utilizing 2 cores. 大多数时候,顶级报告称Java进程正在充分利用2个核心。 During that time, the VM is totally non-responsible: no logs are written and it doesn't respond to outside tools such as jstack or kill -3. 在此期间,VM完全不负责任:没有写入日志,也没有响应外部工具,如jstack或kill -3。 Once the VM recovers, the process continues as per normal, until the next hang. VM恢复后,该过程将按正常方式继续,直到下一次挂起。

strace shows that during these hangs, only 2 threads make system calls. strace显示在这些挂起期间,只有2个线程进行系统调用。 These were the VM threads "VM Thread" (21776) and "VM Periodic Task Thread" (21786). 这些是VM线程“VM线程”(21776)和“VM周期性任务线程”(21786)。 Presumably, these 2 threads are using up the CPU time. 据推测,这两个线程耗尽了CPU时间。 The application threads occasionally wake up and do their work. 应用程序线程偶尔会唤醒并完成工作。 The rest of the time they seem to be waiting on various futexes. 其余的时间他们似乎在等待各种futexes。 Incidentally, the first line of the normal phase is always a SIGSEGV. 顺便提一下,正常阶段的第一行始终是SIGSEGV。

[pid 21776] sched_yield()               = 0
[pid 21776] sched_yield()               = 0
[pid 21776] sched_yield( <unfinished ...>
[pid 21786] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 21776] <... sched_yield resumed> ) = 0
[pid 21786] futex(0x2aabac71ef28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 21776] sched_yield( <unfinished ...>
[pid 21786] <... futex resumed> )       = 0
[pid 21786] clock_gettime(CLOCK_MONOTONIC, {517080, 280918033}) = 0
[pid 21786] clock_gettime(CLOCK_REALTIME, {1369750039, 794028000}) = 0
[pid 21786] futex(0x2aabb81b94c4, FUTEX_WAIT_PRIVATE, 1, {0, 49923000} <unfinished ...>
[pid 21776] <... sched_yield resumed> ) = 0
[pid 21776] sched_yield()               = 0
[pid 21776] sched_yield()               = 0
[pid 21955] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[pid 21955] rt_sigreturn(0x2b1cde2f54ad <unfinished ...>

The problem manifests itself in 2 different servers. 问题表现在2个不同的服务器上。 Rolling back our code version only worked for one of the 2 servers. 回滚我们的代码版本仅适用于2台服务器中的一台。 No error messages were reported in system logs, and another Java process on the affected machine is behaving correctly. 系统日志中未报告任何错误消息,受影响的计算机上的另一个Java进程正常运行。

This following output was obtained with gstack and shows 2 typical waiting application threads: 以下输出是使用gstack获得的,并显示了2个典型的等待应用程序线程:

Thread 552 (Thread 0x4935f940 (LWP 21906)):
#0  0x00000030b040ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b1cdd8548d6 in os::PlatformEvent::park(long) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#2  0x00002b1cdd92b230 in ObjectMonitor::wait(long, bool, Thread*) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#3  0x00002b1cdd928853 in ObjectSynchronizer::wait(Handle, long, Thread*) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#4  0x00002b1cdd69b716 in JVM_MonitorWait () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#5  0x00002b1cde193cc8 in ?? ()
#6  0x00002b1ce2552d90 in ?? ()
#7  0x00002b1cdd84fc23 in os::javaTimeMillis() () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#8  0x00002b1cde188a82 in ?? ()
#9  0x0000000000000000 in ?? ()
Thread 551 (Thread 0x49460940 (LWP 21907)):
#0  0x00000030b040ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b1cdd854d6f in Parker::park(bool, long) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#2  0x00002b1cdd98a1c8 in Unsafe_Park () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#3  0x00002b1cde193cc8 in ?? ()
#4  0x000000004945f798 in ?? ()
#5  0x00002b1cde188a82 in ?? ()
#6  0x0000000000000000 in ?? ()

We looked at issues with NTPD, including leap second bugs, but the suggested workarounds didn't help, neither did using external NTPD servers. 我们研究了NTPD的问题,包括闰秒错误,但建议的解决方法没有帮助,也没有使用外部NTPD服务器。 Restarting the machine itself didn't help as well. 重启机器本身也无济于事。 We have GC logging enabled, and it doesn't look it a GC issue, as there are no messages indicating it. 我们已启用GC日志记录,并且它不会将其视为GC问题,因为没有消息指示它。 Looking for any suggestions that can help with this issue, any help is much appreciated. 寻找任何可以帮助解决此问题的建议,非常感谢任何帮助。

Here are a couple of things I'd look at: 以下是我要看的几件事:

  • When the JVM is unresponsive, use iostat and vmstat to see if the system is thrashing. 当JVM没有响应时,使用iostatvmstat查看系统是否在颠簸。 This can happen when you over-allocate memory; 当你过度分配内存时会发生这种情况; ie your overall system is using significantly more virtual memory than physical memory. 即整个系统使用的虚拟内存明显多于物理内存。

  • Turn on the JVM's GC logging, and see if there is a correlation between the JVM going unresponsive and GC runs. 打开JVM的GC日志记录,查看JVM无响应和GC运行之间是否存在关联。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM