简体繁体 English

取消已锁定互斥锁的线程不会解锁互斥锁

[英]Cancelling a thread that has a mutex locked does not unlock the mutex

原文 2013-01-10 22:02:31 7 2 c++/ c/ linux/ mutex

helping a client out with an issue that they are having. 帮助客户解决他们遇到的问题。 I'm more of a sysadmin/DBA guy so I'm struggling with helping them out. 我更像是一个系统管理员/ DBA人，所以我正在努力帮助他们。 They are saying it is a bug in the kernel/environment, I'm trying to either prove or disprove that before I insist that it is in their code or seek vendor support for the OS. 他们说这是内核/环境中的一个错误，我试图证明或反驳它之前我坚持认为它是在他们的代码中或寻求供应商支持操作系统。

Happens on Red Hat and Oracle Enterprise Linux 5.7 (and 5.8), application is written in C++ 发生在Red Hat和Oracle Enterprise Linux 5.7（和5.8）上，应用程序是用C ++编写的

The problem they are experiencing is that the main thread starts a separate thread to do a potentially long-running TCP connect() [client connecting to server]. 他们遇到的问题是主线程启动一个单独的线程来执行可能长时间运行的TCP connect（）[客户端连接到服务器]。 If the 'long-running' aspect takes too long, they cancel the thread and start another one. 如果“长时间运行”方面花费的时间太长，它们会取消线程并启动另一个线程。

This is done because we don't know the state of the server program: 这样做是因为我们不知道服务器程序的状态：

server program up and running --> connection immediately accepted 服务器程序启动并运行 - >立即接受连接
server program not running, machine and network OK --> connection immediately failed with error 'connection refused' 服务器程序没有运行，机器和网络正常 - >连接立即失败，错误'连接被拒绝'
machine or network crashed or down --> connection takes a long time to fail with error 'no route to host' 机器或网络崩溃或关闭 - >连接需要很长时间才能失败并显示错误'无路由到主机'

The problem is that cancelling the thread that has the mutex locked (with cleanup handlers set up to unlock the mutex) sometimes does NOT unlock the mutex. 问题是取消锁定互斥锁的线程（清理处理程序设置为解锁互斥锁）有时不解锁互斥锁。

That leaves the main thread hung on trying to lock the mutex. 这使得主线程试图锁定互斥锁。

Detailed environment info: 详细环境信息：

glibc-2.5-65 的glibc-2.5-65
glibc-2.5-65 的glibc-2.5-65
libcap-1.10-26 的libcap-1.10-26
kernel-debug-2.6.18-274.el5 内核调试2.6.18-274.el5
glibc-headers-2.5-65 glibc的报头-2.5-65
glibc-common-2.5-65 的glibc-共2.5-65
libcap-1.10-26 的libcap-1.10-26
kernel-doc-2.6.18-274.el5 内核DOC-2.6.18-274.el5
kernel-2.6.18-274.el5 内核2.6.18-274.el5
kernel-headers-2.6.18-274.el5 内核头文件，2.6.18-274.el5
glibc-devel-2.5-65 的glibc-devel的-2.5-65

Code was built with: c++ -g3 tst2.C -lpthread -o tst2 代码是使用以下代码构建的：c ++ -g3 tst2.C -lpthread -o tst2

Any advice and guidance is greatly appreciated 非常感谢任何建议和指导

2 个解决方案

It's correct that cancelled threads do not unlock mutexes they hold, you need to arrange for that to happen manually, which can be tricky as you need to be very careful to use the right cleanup handlers around every possible cancellation point. 取消的线程没有解锁它们所持有的互斥锁是正确的，你需要安排手动进行，这可能很棘手，因为你需要非常小心地在每个可能的取消点周围使用正确的清理处理程序。 Assuming you're using pthread_cancel to cancel the thread and setting cleanup handlers with pthread_cleanup_push to unlock the mutexes, there are a couple of alternatives you could try which might be simpler to get right and so may be more reliable. 假设你正在使用pthread_cancel取消线程和设定清除处理程序与pthread_cleanup_push解锁互斥，有几个选择，你可以尝试，这可能是简单的得到的权利，因此可能会更可靠。

Using RAII to unlock the mutex will be more reliable. 使用RAII解锁互斥锁将更加可靠。 On GNU/Linux pthread_cancel is implemented with a special exception of type __cxxabi::__forced_unwind , so when a thread is cancelled an exception is thrown and the stack is unwound. 在GNU / Linux上， pthread_cancel是使用__cxxabi::__forced_unwind类型的特殊异常实现的，因此当一个线程被取消时，抛出异常并解除堆栈。 If a mutex is locked by an RAII type then its destructor will be guaranteed to run if the stack is unwound by a __forced_unwind exception. 如果互斥锁被RAII类型锁定，那么如果堆栈被__forced_unwind异常展开，则它的析构函数将保证运行。 Boost Thread provides a portable C++ library that wraps Pthreads and is much easier to use. Boost Thread提供了一个可移植的C ++库，它包装Pthreads并且更容易使用。 It provides an RAII type boost::mutex and other useful abstractions. 它提供了RAII类型的boost::mutex和其他有用的抽象。 Boost Thread also provides its own "thread interruption" mechanism which is similar to Pthread cancellation but not the same, and Pthread cancellation points (such as connect ) are not Boost Thread interruption points, which can be helpful for some applications. Boost Thread还提供了自己的“线程中断”机制，类似于Pthread取消但不相同，并且Pthread取消点（例如connect ）不是Boost Thread中断点，这对某些应用程序很有帮助。 However in your client's case since the point of cancellation is to interrupt the connect call they probably do want to stick with Pthread cancellation. 但是在客户端的情况下，由于取消点是中断connect调用，他们可能确实希望坚持使用Pthread取消。 The (non-portable) way GNU/Linux implements cancellation as an exception means it will work well with boost::mutex . （非可移植）方式GNU / Linux实现取消作为异常意味着它将与boost::mutex 。

There is really no excuse for explicitly locking and unlocking mutexes when you're writing in C++, IMHO the most important and most useful feature of C++ is destructors which are ideal for automatically releasing resources such as mutex locks. 当您使用C ++编写时，没有理由明确锁定和解锁互斥锁，恕我直言，C ++ 最重要和最有用的功能是析构函数，它是自动释放互斥锁等资源的理想选择。

Another option would be to use a robust mutex, which is created by calling pthread_mutexattr_setrobust on a pthread_mutexattr_t before initializing the mutex. 另一种选择是使用一个强大的互斥体，它是通过调用创建pthread_mutexattr_setrobust上pthread_mutexattr_t初始化互斥之前。 If a thread dies while holding a robust mutex the kernel will make a note of it so that the next thread which tries to lock the mutex gets the special error code EOWNERDEAD . 如果一个线程在保持一个健壮的互斥锁时死掉，内核会记下它，以便下一个试图锁定互斥锁的线程获得特殊的错误代码EOWNERDEAD 。 If possible, the new thread can make the data protected by the thread consistent again and take ownership of the mutex. 如果可能，新线程可以使线程保护的数据再次保持一致并获得互斥锁的所有权。 This is much harder to use correctly than simply using an RAII type to lock and unlock the mutex. 这比使用RAII类型锁定和解锁互斥锁要困难得多。

A completely different approach would be to decide if you really need to hold the mutex lock while calling connect . 一种完全不同的方法是确定在调用connect时是否确实需要保持互斥锁。 Holding mutexes during slow operations is not a good idea. 在慢速操作期间持有互斥锁并不是一个好主意。 Can't you call connect then if successful lock the mutex and update whatever shared data is being protected by the mutex? 如果成功锁定互斥锁并更新互斥锁保护的共享数据，您是否可以调用connect ？

My preference would be to both use Boost Thread and avoid holding the mutex for long periods. 我倾向于使用Boost Thread并避免长时间持有互斥锁。

The problem they are experiencing is that the main thread starts a separate thread to do a potentially long-running TCP connect() [client connecting to server]. 他们遇到的问题是主线程启动一个单独的线程来执行可能长时间运行的TCP connect（）[客户端连接到服务器]。 If the 'long-running' aspect takes too long, they cancel the thread and start another one. 如果“长时间运行”方面花费的时间太长，它们会取消线程并启动另一个线程。

Trivial fix -- don't cancel the thread. 琐碎的修复 - 不要取消线程。 Is it doing any harm? 它有害吗？ If necessary, have the thread check (when the connect finally does complete) whether the connection is still needed and, if not, close it, release the mutex, and terminate. 如有必要，请检查线程（当connect最终完成时）是否仍然需要连接，如果不是，则关闭它，释放互斥锁，然后终止。 You can do this with a boolean variable protected by a mutex. 您可以使用受互斥锁保护的布尔变量来执行此操作。

Also, a thread should not hold a mutex while waiting for network I/O. 此外，线程在等待网络I / O时不应持有互斥锁。 Mutexes should be used only for things that are fast and primarily CPU-limited or perhaps limited by local disk. 互斥锁应仅用于快速且主要受CPU限制或可能受本地磁盘限制的内容。

Finally, if you feel you need to reach in from the outside and force a thread to do something, step back. 最后，如果您觉得需要从外部进入并迫使线程做某事，请退后一步。 You wrote the code for that thread. 您编写了该线程的代码。 If you feel that need, it means you didn't code that thread to do what you really wanted it to do. 如果您觉得需要，那就意味着您没有编写该线程来执行您真正想要的操作。 The fix is to modify the thread to do what, and only what, you actually want. 修复是修改线程做什么，只做你真正想要的。 Then you won't have to "push it around" from the outside. 那你就不必从外面“推它”了。