线程转储显示Runnable状态，但是挂起了很长时间

Question

We are facing an unusual problem in our application, in the last one month our application reached an unrecoverable state, It was recovered post application restart. 我们在应用程序中遇到一个不寻常的问题，在过去的一个月中，我们的应用程序达到了不可恢复的状态，在应用程序重启后已恢复。

Background : Our application makes a DB query to fetch some information and this Database is hosted on a separate node. 背景：我们的应用程序进行数据库查询以获取一些信息，并且该数据库托管在单独的节点上。

Problematic case : When the thread dump was analyzed we see all the threads are in runnable state fetching the data from the database, but it didn't finished even after 20 minutes. 有问题的情况：分析了线程转储后，我们看到所有线程都处于可运行状态，正在从数据库中获取数据，但即使20分钟后它也没有完成。

Post the application restart as expected all threads recovered. 按预期方式重新启动应用程序，恢复所有线程。 And the CPU usage was also normal. 而且CPU使用率也很正常。

Below is the thread dump 下面是线程转储

ThreadPool:2:47" prio=3 tid=0x0000000007334000 nid=0x5f runnable [0xfffffd7fe9f54000] java.lang.Thread.State: RUNNABLE at oracle.jdbc.driver.T2CStatement.t2cParseExecuteDescribe(Native Method) at oracle.jdbc.driver.T2CPreparedStatement.executeForDescribe(T2CPreparedStatement.java:518) at oracle.jdbc.driver.T2CPreparedStatement.executeForRows(T2CPreparedStatement.java:764) at ora 线程池：2：47“ prio = 3 tid = 0x0000000007334000 nid = 0x5f runnable [0xfffffd7fe9f54000] java.lang.Thread.State：在oracle.jdbc.driver.T2CStatement.t2cParseExecuteDescribe（本机方法）处在RUN.ABLE在oracle.jdbc.driverState.T2 oracle.jdbc.driver.T2CPreparedStatement.executeForRows（T2CPreparedStatement.java:764）上的.executeForDescribe（T2CPreparedStatement.java:518）

All threads in the same state.

Questions: 问题：

what could be the reason for this state? 这种状态的原因可能是什么？
how to recover under this case ? 在这种情况下如何恢复？

Answer 1

As others mentioned already, that native methods are always in runnable, as the JVM doesn't know/care about them. 正如其他人已经提到的那样，本机方法始终是可运行的，因为JVM不知道/不在乎它们。

The Oracle drivers on the client side have no socket timeout by default. 默认情况下，客户端上的Oracle驱动程序没有套接字超时。 This means if you have network issues, the client's low level socket may "stuck" there for ever, resulting in a maxxed out connection pool. 这意味着如果您遇到网络问题，客户端的低级套接字可能会永远“卡在”那里，从而导致连接池最大化。 You could also check the network trafic towards the Oracle server to see if it even transmits data or not. 您还可以检查通向Oracle服务器的网络流量，以查看它是否甚至可以传输数据。

When using the thin client, you can set oracle.jdbc.ReadTimeout , but I don't know how to do that for the thick (oci) client you use, I'm not familiar with it. 使用瘦客户端时，可以设置oracle.jdbc.ReadTimeout ，但是我不知道如何为所使用的胖（oci）客户端执行此操作，我对此并不熟悉。

What to do? 该怎么办？ Research how can you specify read timeout for the thick ojdbc driver, and watch for exceptions related to the connection timeout, that will clearly signal network issues. 研究如何为厚的ojdbc驱动程序指定读取超时，并注意与连接超时有关的异常，这些异常将清楚地表明网络问题。 If you can change the source, you can wrap the calls and retry the session when you catch timeout-related SQLExceptions. 如果可以更改源，则可以在捕获超时相关的SQLException时包装调用并重试会话。

To quickly address the issue, terminate the connection on the Oracle server manually. 要快速解决该问题，请手动终止Oracle服务器上的连接。

Worth checking the session contention, maybe a query blocks these sessions. 值得检查会话争用，也许查询会阻止这些会话。 If you find one, you'll see which database object causes the problem. 如果找到一个，您将看到哪个数据库对象引起了问题。

Answer 2

It's probably waiting for network data from the database server. 它可能正在等待来自数据库服务器的网络数据。 Java threads waiting (blocked) on I/O are described by the JVM as being in the state RUNNABLE even though from the program's point of view they're blocked. JVM将在I / O上等待（阻塞）的Java线程描述为处于RUNNABLE状态，即使从程序的角度来看，它们被阻塞了。

Answer 3

Is the system or JVM getting hanged? 系统或JVM是否被挂起？ If configurable and if possible, reduce the number of threads/ parallel connections. 如果是可配置的，并且可能的话，减少线程/并行连接的数量。

The thread simply waste CPU cycles when waiting for IO. 等待IO时，线程仅浪费CPU周期。 Yes your CPU is unfortunately kept busy by the threads who are awaiting a response from DB. 是的，不幸的是，等待DB响应的线程使CPU繁忙。

Answer 4

Does your code manually handle transaction? 您的代码是否手动处理交易？ If then, maybe some of the code didn't commit() after changing data. 如果那样的话，也许某些代码在更改数据后没有commit（）。 Or maybe someone ran data modification query directly through PLSQL or something and didn't commit, and that leads all reading operation to be hung. 也许有人直接通过PLSQL或其他方式运行了数据修改查询，但没有提交，这导致所有读取操作被挂起。
When you experienced that "hung" and DB has recovered from the status, did you check the data if some of them were rolled back? 当您遇到“挂起”状态且DB已从状态中恢复时，您是否检查了其中的某些数据是否已回滚？ Asking this since you said "It was recovered post application restart.". 因为您说“它已在应用程序重新启动后恢复”，所以询问此问题。 It's happening when JDBC driver changed stuff but didn't commit, and timeout happened... DB operation will be rolled back. 当JDBC驱动程序更改了东西但没有提交，并且发生了超时时，就会发生这种情况...数据库操作将回滚。 ( can be different based on the configuration though ) （尽管根据配置可能有所不同）

Answer 5

Native methods remain always in RUNNABLE state (ok, unless you change the state from the native method, itself, but this doesn't count). 本机方法RUNNABLE状态（好吧，除非你从本地方法，改变自己的状态，但是这并不能算）始终保持。

The method can be blocked on IO, any other event waiting or just long cpu intense task... or endless loop. 该方法可以在IO，任何其他事件等待或长时间的CPU密集任务...或无限循环中被阻止。 You can make your own pick. 您可以自己选择。

how to recover under this case ? 在这种情况下如何恢复？

drop the connection from oracle. 从oracle删除连接。

线程转储显示Runnable状态，但是挂起了很长时间

问题描述

5 个解决方案

解决方案1
1 2015-02-16 17:02:15

解决方案2
1 2012-01-23 08:49:41

解决方案3
0 2015-02-20 11:09:43

解决方案4
0 2015-02-21 01:25:50

解决方案5
0 2012-01-23 09:18:12

线程转储显示Runnable状态，但是挂起了很长时间

问题描述

5 个解决方案

解决方案1 1 2015-02-16 17:02:15

解决方案2 1 2012-01-23 08:49:41

解决方案3 0 2015-02-20 11:09:43

解决方案4 0 2015-02-21 01:25:50

解决方案5 0 2012-01-23 09:18:12

解决方案1
1 2015-02-16 17:02:15

解决方案2
1 2012-01-23 08:49:41

解决方案3
0 2015-02-20 11:09:43

解决方案4
0 2015-02-21 01:25:50

解决方案5
0 2012-01-23 09:18:12