在gc期间调用JNI函数时JVM崩溃

Question

We have a Java application that has a JNI layer that is multi-threaded (pthread) and will call back to the Java level upon messages received from the underlying network. 我们有一个Java应用程序，该应用程序具有多线程（pthread）的JNI层，并且将在从底层网络接收到消息时回调到Java级别。

We notice that every time it crashes, it is caused by a gc. 我们注意到，每次崩溃它都是由gc引起的。 We can even simulate such a crash by manually trigger a gc by calling jmap -histo <pid> while the JNI layer is receiving messages from the network. 我们甚至可以通过在JNI层从网络接收消息时调用jmap -histo <pid>手动触发gc来模拟这种崩溃。

Given the information that we have read about the behaviours in JVM during GC in this post, https://stackoverflow.com/a/39401467/4523221 , we still couldn't figure out why such crash is related to gc since JNI function calls are blocked during gc. 鉴于我们在这篇文章https://stackoverflow.com/a/39401467/4523221中已了解到有关GC期间JVM中的行为的信息，由于JNI函数调用，我们仍然无法弄清楚为什么这种崩溃与gc有关在gc期间被阻止。

If anyone can shed light on this, it will be great. 如果任何人都可以阐明这一点，那就太好了。 Thanks in advance. 提前致谢。

The following is a stack trace that we have collected after a crash in our application. 以下是我们在应用程序崩溃后收集的堆栈跟踪。

Program terminated with signal 6, Aborted.
#0  0x0000003cdce325e5 in raise () from /lib64/libc.so.6
#1  0x0000003cdce33dc5 in abort () from /lib64/libc.so.6
#2  0x00007fdafe2516b5 in os::abort(bool) () from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#3  0x00007fdafe3efbf3 in VMError::report_and_die() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#4  0x00007fdafde2f3e2 in report_vm_error(char const*, int, char const*, char const*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#5  0x00007fdafe24c1ff in os::PlatformEvent::park() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#6  0x00007fdafe20c538 in Monitor::ILock(Thread*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#7  0x00007fdafe20c73f in Monitor::lock_without_safepoint_check() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#8  0x00007fdafe2e7a1f in SafepointSynchronize::block(JavaThread*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#9  0x00007fdafe39bcdd in JavaThread::check_safepoint_and_suspend_for_native_trans(JavaThread*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#10 0x00007fdafe0123d8 in jni_NewByteArray ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#11 0x00007fdaa447b7d1 in JNIEnv_::NewByteArray (this=0x7fdaf800c9f8, len=7)
    at /usr/java/jdk1.8.0_65/include/jni.h:1643
---omitted---
#19 0x0000003cdd20b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#20 0x00007fdafe24c133 in os::PlatformEvent::park() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#21 0x00007fdafe20ce27 in Monitor::IWait(Thread*, long) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#22 0x00007fdafe20d5f0 in Monitor::wait(bool, long, bool) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
---Type <return> to continue, or q <return> to quit---
#23 0x00007fdafe39ed51 in Threads::destroy_vm() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#24 0x00007fdafdfff931 in jni_DestroyJavaVM ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#25 0x00007fdafe91a63d in JavaMain () from /usr/java/jdk1.8.0_65/bin/../lib/amd64/jli/libjli.so
#26 0x0000003cdd207aa1 in start_thread () from /lib64/libpthread.so.0
#27 0x0000003cdcee8aad in clone () from /lib64/libc.so.6

The way we obtained JNIEnv* eg 我们获得JNIEnv *的方式，例如

JNIEnv *env = 0;
jint result = jvm->GetEnv((void **) &env, JNI_VERSION_1_8);
if (result != JNI_OK) {
    result = jvm->AttachCurrentThread((void **) &env, NULL);

Answer 1

After spending days investigating this JNI issue, we have finally found out the reason and I would like to share our experience here so that hopefully it will help others. 在花了几天时间调查这个JNI问题之后，我们终于找到了原因，我想在这里分享我们的经验，希望对其他人有帮助。

First of all, the reason why we needed to use JNI in the first place was because we needed to make use of a 3rd party network library that was a Linux native lib, and unfortunately that was the cause of our problem. 首先，首先需要使用JNI的原因是因为我们需要使用Linux本地库的第3方网络库，不幸的是，这是导致问题的原因。

The library provided us a callback handle that we implemented to receive incoming network messages from it, and this callback, we later found out, was simply a signal handler. 该库为我们提供了一个回调句柄，我们实现了该回调句柄以从中接收传入的网络消息，而我们后来发现，该回调只是一个信号处理程序。 So, it means that this signal handler would get called whenever a signal popped up, even during gc. 因此，这意味着无论何时出现信号，即使在gc期间，都会调用此信号处理程序。

Since C threads keep running during safepoints in JVM, it would have been fine if those C threads weren't attached to the JVM, otherwise disasters would certainly strike. 由于C线程在JVM中的安全点期间保持运行，因此如果这些C线程未附加到JVM会很好，否则灾难肯定会发生。

Here is kind of what we thought had happened. 这是我们认为已经发生的事情。 (everything below happened in the JNI layer) （以下所有内容均发生在JNI层中）

App starts. 应用启动。 We init and cached JNI resources, eg Jmv*, Method ID, etc. 我们初始化并缓存了JNI资源，例如Jmv *，方法ID等。
We registered a C function to the library to receive messages. 我们向库注册了一个C函数以接收消息。 The C function is a function that would call JNI APIs to allocate memory to accommodate the received messages and pass them onto Java. C函数是一个函数，它将调用JNI API来分配内存以容纳接收到的消息并将它们传递给Java。 After that, we just started to wait for incoming messages. 之后，我们刚刚开始等待传入消息。
When a message finally arrived, the C function mentioned above was called to handle the message, but wait... what was this thread that's handling the callback. 当一条消息最终到达时，上面提到的C函数被调用以处理该消息，但是请稍候...此线程正在处理回调。 That would have been the main thread or hmm... any available threads. 那将是主线程或嗯...任何可用线程。
As taught in any JNI textbook, we attached the thread to JVM first if not yet done so before calling any JNI APIs. 正如任何JNI教科书中所教导的那样，如果尚未将线程附加到JVM，则在调用任何JNI API之前先将其附加到JVM。 Great! 大！
Now, during a GC, all Java threads were blocked, but the C layer was still running. 现在，在GC期间，所有Java线程均被阻止，但C层仍在运行。 At this critical moment, if a message arrived, some thread (any threads) was called up to handle the message. 在此关键时刻，如果消息到达，则调用某些线程（任何线程）来处理消息。 But what threads were still available during gc? 但是在gc期间哪些线程仍然可用？ All application threads were blocked and the only ones that were still running at this moment (our guess) were unfortunately the gc threads. 所有应用程序线程均被阻止，并且目前唯一仍在运行的线程（我们的猜测）是gc线程。

The gdb stacktrace that we were seeing was basically what happened when a gc thread that was actually in a middle of doing some work on the heap and then got a call from our application to do some application work and then a few JNI API calls... BOOM 我们看到的gdb stacktrace基本上是当一个gc线程实际上正在堆上进行一些工作，然后从我们的应用程序中调用进行一些应用程序工作，然后进行了一些JNI API调用时发生的。 。BOOM

Solution: 解：

Have a C thread that handles the library callback 有一个C线程来处理库回调
Never attach that thread to JVM 切勿将该线程附加到JVM
Have other threads attach to the JVM to do the Native-Java transition. 将其他线程附加到JVM进行本机Java转换。

ps maybe some of the details weren't exactly accurate, so any JVM expert advice is welcomed. ps也许某些细节不完全准确，所以欢迎任何JVM专家建议。 I will try to correct them as advised. 我会尝试按照建议纠正它们。

Thanks 谢谢

Update.1 (@apangin): We have another gdb stacktrace here. Update.1（@apangin）：我们在这里还有另一个gdb stacktrace。 Just wondering if the GangWorker at #18 was a parallel GC thread. 只是想知道＃18的GangWorker是否是并行GC线程。

#0  0x00000035b90325e5 in raise () from /lib64/libc.so.6
#1  0x00000035b9033dc5 in abort () from /lib64/libc.so.6
#2  0x00007febd60813b5 in os::abort(bool) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#3  0x00007febd6223673 in VMError::report_and_die() () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#4  0x00007febd60868bf in JVM_handle_linux_signal () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#5  0x00007febd607ce13 in signalHandler(int, siginfo*, void*) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x00007feb9fcf551c in JNIEnv_::NewByteArray (this=0x7febd001d9f8, len=8) at /usr/java/jdk1.8.0_131/include/jni.h:1643
*<omitted app specific calls>*
#13 <signal handler called>
#14 0x00000035b980b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#15 0x00007febd607b7e3 in os::PlatformEvent::park() () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#16 0x00007febd603c037 in Monitor::IWait(Thread*, long) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#17 0x00007febd603c956 in Monitor::wait(bool, long, bool) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#18 0x00007febd6244d6b in GangWorker::loop() () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#19 0x00007febd6082568 in java_start(Thread*) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#20 0x00000035b9807aa1 in start_thread () from /lib64/libpthread.so.0
#21 0x00000035b90e8aad in clone () from /lib64/libc.so.6

在gc期间调用JNI函数时JVM崩溃

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-05-23 03:13:07

在gc期间调用JNI函数时JVM崩溃

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-05-23 03:13:07

解决方案1
6 已采纳 2017-05-23 03:13:07