如何从strace输出确定程序的哪一部分无法获取互斥锁

Question

I'm working on an embedded Linux system (3.12.something), and our application, after some random amount of time, starts hogging the CPU. 我正在一个嵌入式Linux系统（3.12.something）上工作，经过一段时间，我们的应用程序开始占用CPU。 I've run strace on our application, and right when the problem happens, I see a lot of lines similar to this in the strace output: 我已经在我们的应用程序上运行了strace ，并且当问题发生时，我在strace输出中看到很多与此类似的行：

[48530666] futex(0x485f78b8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.009002>

I'm pretty sure this is the smoking gun I'm looking for and there is a race of some sort. 我很确定这是我正在寻找的吸烟枪，并且有某种比赛。 However, I now need to figure out how to identify the place in the code that's trying to get this mutex. 但是，我现在需要弄清楚如何在试图获取此互斥锁的代码中标识该位置。 How can I do that? 我怎样才能做到这一点？ Our code is compiled with GCC and has debugging symbols in it. 我们的代码是使用GCC编译的，并且其中包含调试符号。

My current thinking (that I haven't tried yet) is to print out a string to stdout and flush before trying to grab any mutex in our system, with the expectation that the string will print right before strace complains about getting the lock ... but there are a LOT of places in the code that would have to be instrumented like this. 我目前的想法（我还没有尝试过）是在尝试获取系统中的任何互斥锁之前，先将一个字符串输出到stdout并刷新，以期该字符串将在strace抱怨获取锁之前打印出来。但是，在代码中有很多地方必须像这样进行检测。

EDIT: Another strange thing that I just realized is that our program doesn't start hogging the CPU until some random time has passed since it was run (5 minutes to 5 hours and anywhere in between). 编辑：我刚刚意识到的另一件奇怪的事情是，我们的程序直到运行了一段时间（5分钟到5小时以及介于两者之间的任何时间），才开始占用CPU。 During that time, there are zero futex syscalls happening. 在此期间，发生了零次 futex系统调用。 Why do they suddenly start? 他们为什么突然开始？ From what I've read, I think maybe they are being used properly in userspace until something fails and falls back to making a futex() syscall... 根据我的阅读，我认为也许它们已经在用户空间中正确使用，直到出现故障并退回到进行futex() syscall为止。

Any suggestions? 有什么建议么？

Answer 1

If you perpetually and often lock a mutex for a short time from different threads, like eg one protecting a global logger, you might cause a so-called thread convoy. 如果您永久性地并经常将互斥锁从不同的线程锁定一小段时间，例如保护全局记录器，则可能会导致所谓的线程保护。 The problem doesn't occur until two threads compete for the lock. 在两个线程争夺锁之前，不会发生此问题。 The first gets the lock and holds it for a short time, then, when it needs the lock a second time, it gets preempted because the second one is waiting already. 第一个获得该锁并保持一小段时间，然后，当第二次需要该锁时，它被抢占，因为第二个已经在等待。 The second one does the same. 第二个相同。 The timeslice available to each thread is suddenly reduced to the time between two lock attempts, causing many context switches and the according slowdown. 每个线程可用的时间片突然减少到两次锁定尝试之间的时间，从而导致许多上下文切换和相应的速度降低。 Further, all but one thread is always blocked on the mutex, effectively disabling any parallel execution. 此外，除一个线程外，所有线程始终在互斥体上被阻止，从而有效地禁用了任何并行执行。

In order to fix this, make your threads cooperate instead of competing for resources. 为了解决这个问题，让您的线程合作而不是争夺资源。 For above example of a logger, consider eg a lock-free queue for the entries or separate queues for each thread using thread-local storage. 对于记录器的上述示例，请考虑例如使用线程本地存储的条目的无锁队列或每个线程的单独队列。

Concerning the futex() calls, the idea is to poll an atomic flag and after some rotations use the actual OS mutex. 关于futex（）调用，其想法是轮询一个原子标志，并在进行一些旋转后使用实际的OS互斥锁。 The atomic flag is available without the expensive switch between user-space and kernel-space. 无需在用户空间和内核空间之间进行昂贵的切换即可使用原子标志。 For longer breaks, using the kernel preemption (with futex() ) avoids blocking the CPU with polling. 对于更长的中断，使用内核抢占（与futex() ）可以避免通过轮询阻塞CPU。 This explains why the program doesn't need any futex() calls in normal operation. 这解释了为什么程序在正常操作中不需要任何futex()调用。

Answer 2

You, basically need to generate core file at this moment. 您，此时基本上需要生成核心文件。

Then you could load program+core in GDB and look at it 然后，您可以在GDB中加载program + core并进行查看

man gcore

or 要么

generate-core-file

During that time, there are zero futex syscalls happening. 在此期间，发生了零次futex系统调用。 Why do they suddenly start? 他们为什么突然开始？

This is due to the fact that uncontested mutex, implemented via futex, doesn't make a system call, only atomic increment, purely in user space. 这是由于这样的事实，即通过futex实现的无竞争的互斥体不会纯粹在用户空间中进行系统调用，而只会进行原子增量。 Only CONTESTED lock is visible as system call 系统调用中仅显示CONTESTED锁

如何从strace输出确定程序的哪一部分无法获取互斥锁

问题描述

2 个解决方案

解决方案1
0 2015-09-23 20:02:23

解决方案2
0 2015-09-23 20:11:37

如何从strace输出确定程序的哪一部分无法获取互斥锁

问题描述

2 个解决方案

解决方案1 0 2015-09-23 20:02:23

解决方案2 0 2015-09-23 20:11:37

解决方案1
0 2015-09-23 20:02:23

解决方案2
0 2015-09-23 20:11:37