在新的Linux内核中，上下文切换速度要慢得多

Question

We are looking to upgrade the OS on our servers from Ubuntu 10.04 LTS to Ubuntu 12.04 LTS. 我们希望将服务器上的操作系统从Ubuntu 10.04 LTS升级到Ubuntu 12.04 LTS。 Unfortunately, it seems that the latency to run a thread that has become runnable has significantly increased from the 2.6 kernel to the 3.2 kernel. 不幸的是，似乎运行已经变为可运行的线程的延迟从2.6内核到3.2内核显着增加。 In fact the latency numbers we are getting are hard to believe. 事实上，我们得到的延迟数字很难相信。

Let me be more specific about the test. 让我对测试更加具体。 We have a program that runs two threads. 我们有一个运行两个线程的程序。 The first thread gets the current time (in ticks using RDTSC) and then signals a condition variable once a second. 第一个线程获取当前时间（使用RDTSC以滴答为单位），然后每秒发送一次条件变量。 The second thread waits on the condition variable and wakes up when it is signaled. 第二个线程等待条件变量并在发出信号时唤醒。 It then gets the current time (in ticks using RDTSC). 然后它获取当前时间（使用RDTSC以滴答为单位）。 The difference between the time in the second thread and the time in the first thread is computed and displayed on the console. 计算第二个线程中的时间与第一个线程中的时间之间的差异，并在控制台上显示。 After this the second thread waits on the condition variable once more. 在此之后，第二个线程再次等待条件变量。 It will be signaled again by the first thread after about a second passes. 大约第二次通过后，第一个线程将再次发出信号。

So, in a nutshell we get a thread to thread communication via condition variable latency measurement once a second as a result. 因此，简而言之，我们得到一个线程，通过条件可变延迟测量一次一次地进行线程通信 。

In kernel 2.6.32, this latency is somewhere on the order of 2.8-3.5 us, which is reasonable. 在内核2.6.32中，这种延迟大约为2.8-3.5 us，这是合理的。 In kernel 3.2.0, this latency has increased to somewhere on the order of 40-100 us. 在内核3.2.0中，这种延迟已经增加到大约40-100 us。 I have excluded any differences in hardware between the two hosts. 我已经排除了两台主机之间硬件的任何差异。 They run on identical hardware (dual socket X5687 {Westmere-EP} processors running at 3.6 GHz with hyperthreading, speedstep and all C states turned off). 它们运行在相同的硬件上（双插槽X5687 {Westmere-EP}处理器，运行频率为3.6 GHz，具有超线程，speedtep和所有C状态关闭）。 The test app changes the affinity of the threads to run them on independent physical cores of the same socket (ie, the first thread is run on Core 0 and the second thread is run on Core 1), so there is no bouncing of threads on cores or bouncing/communication between sockets. 测试应用程序更改线程的亲和力以在同一套接字的独立物理核心上运行它们（即，第一个线程在Core 0上运行，第二个线程在Core 1上运行），因此没有线程的弹跳套接字之间的核心或弹跳/通信。

The only difference between the two hosts is that one is running Ubuntu 10.04 LTS with kernel 2.6.32-28 (the fast context switch box) and the other is running the latest Ubuntu 12.04 LTS with kernel 3.2.0-23 (the slow context switch box). 两台主机之间的唯一区别是，一台运行Ubuntu 10.04 LTS，内核为2.6.32-28（快速上下文切换盒），另一台运行最新的Ubuntu 12.04 LTS，内核为3.2.0-23（缓慢的上下文）开关盒）。 All BIOS settings and hardware are identical. 所有BIOS设置和硬件都相同。

Have there been any changes in the kernel that could account for this ridiculous slow down in how long it takes for a thread to be scheduled to run? 内核是否有任何变化可以解释线程被安排运行多长时间的这种荒谬的减速？

Update: If you would like to run the test on your host and linux build, I have posted the code to pastebin for your perusal. 更新：如果您想在主机和Linux版本上运行测试，我已将代码发布到pastebin供您阅读。 Compile with: 编译：

g++ -O3 -o test_latency test_latency.cpp -lpthread

Run with (assuming you have at least a dual-core box): 运行（假设您至少有一个双核盒子）：

./test_latency 0 1 # Thread 1 on Core 0 and Thread 2 on Core 1

Update 2 : After much searching through kernel parameters, posts on kernel changes and personal research, I have figured out what the problem is and have posted the solution as an answer to this question. 更新2 ：经过大量内核参数搜索，内核更改和个人研究的帖子后，我已经找出了问题所在并已发布解决方案作为这个问题的答案。

Answer 1

The solution to the bad thread wake up performance problem in recent kernels has to do with the switch to the intel_idle cpuidle driver from acpi_idle , the driver used in older kernels. 在最近的内核的坏线程唤醒性能问题的解决方案与切换到做intel_idle从CPUIDLE司机acpi_idle ，在老版本的内核使用的驱动程序。 Sadly, the intel_idle driver ignores the user's BIOS configuration for the C-states and dances to its own tune . 遗憾的是， intel_idle驱动程序忽略了用户对C状态的BIOS配置并intel_idle 到自己的曲调 。 In other words, even if you completely disable all C states in your PC's (or server's) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (eg, stress) is running. 换句话说，即使您完全禁用PC（或服务器）BIOS中的所有C状态，此驱动程序仍会在短暂不活动期间强制启用它们，这几乎总是发生，除非所有核心消耗合成基准（例如，压力））在跑。 You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z tool on most compatible hardware. 您可以在大多数兼容硬件上使用精彩的Google i7z工具监控C状态转换以及与处理器频率相关的其他有用信息。

To see which cpuidle driver is currently active in your setup, just cat the current_driver file in the cpuidle section of /sys/devices/system/cpu as follows: 要查看您的设置中当前处于活动状态的cpuidle驱动程序，只需在/sys/devices/system/cpu的cpuidle部分中cpuidle current_driver文件，如下所示：

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features: 如果您希望现代Linux操作系统具有最低的上下文切换延迟，请添加以下内核启动参数以禁用所有这些省电功能：

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub and then running update-grub . 在Ubuntu 12.04上，您可以通过将它们添加到/etc/default/grub的GRUB_CMDLINE_LINUX_DEFAULT条目然后运行update-grub 。 The boot parameters to add are: 要添加的引导参数是：

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do: 以下是三个启动选项的详细信息：

Setting intel_idle.max_cstate to zero will either revert your cpuidle driver to acpi_idle (at least per the documentation of the option), or disable it completely. 将intel_idle.max_cstate设置为零将将您的cpuidle驱动程序恢复为acpi_idle （至少根据该选项的文档），或者将其完全禁用。 On my box it is completely disabled (ie, displaying the current_driver file in /sys/devices/system/cpu/cpuidle produces an output of none ). 在我的盒是完全禁止（即，在显示current_driver在文件/sys/devices/system/cpu/cpuidle产生的输出none ）。 In this case the second boot option, processor.max_cstate=0 is unnecessary. 在这种情况下，第二个引导选项， processor.max_cstate=0是不必要的。 However, the documentation states that setting max_cstate to zero for the intel_idle driver should revert the OS to the acpi_idle driver. 但是，文档指出将intel_idle驱动程序的intel_idle设置为零应将操作系统还原为acpi_idle驱动程序。 Therefore, I put in the second boot option just in case. 因此，我放入第二个启动选项以防万一。

The processor.max_cstate option sets the maximum C state for the acpi_idle driver to zero, hopefully disabling it as well. processor.max_cstate选项将acpi_idle驱动程序的最大C状态设置为零，希望同样禁用它。 I do not have a system that I can test this on, because intel_idle.max_cstate=0 completely knocks out the cpuidle driver on all of the hardware available to me. 我没有可以测试它的系统，因为intel_idle.max_cstate=0完全击败了我可用的所有硬件上的cpuidle驱动程序。 However, if your installation does revert you from intel_idle to acpi_idle with just the first boot option, please let me know if the second option, processor.max_cstate did what it was documented to do in the comments so that I can update this answer. 但是，如果您的安装确实只使用第一个引导选项将您从intel_idle恢复为acpi_idle ，请告诉我第二个选项， processor.max_cstate是否执行了注释中记录的操作，以便我可以更新此答案。

Finally, the last of the three parameters, idle=poll is a real power hog. 最后，三个参数中的最后一个， idle=poll是一个真正的耗电量。 It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it's really necessary. 它将禁用C1 / C1E，这将消除最后剩余的延迟时间，代价是更多的功耗，所以只有在真正需要时才使用它。 For most this will be overkill, since the C1* latency is not all that large. 对于大多数人来说这将是过度杀伤，因为C1 *延迟并不是那么大。 Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. 使用我在原始问题中描述的硬件上运行的测试应用程序，延迟从9 us到3 us。 This is certainly a significant reduction for highly latency sensitive applications (eg, financial trading, high precision telemetry/tracking, high freq. data acquisition, etc...), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. 对于高延迟敏感的应用程序（例如，金融交易，高精度遥测/跟踪，高频率数据采集等等），这无疑是一个显着的减少，但可能不值得为绝大多数桌面应用。 The only way to know for sure is to profile your application's improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs. 确切知道的唯一方法是分析应用程序的性能改进与硬件功耗/热量的实际增长，并权衡权衡。

Update: 更新：

After additional testing with various idle=* parameters, I have discovered that setting idle to mwait if supported by your hardware is a much better idea. 在使用各种idle=*参数进行额外测试后，我发现如果硬件支持将idle设置为mwait是一个更好的主意。 It seems that the use of the MWAIT/MONITOR instructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. 似乎使用MWAIT/MONITOR指令允许CPU进入C1E，而不会在线程唤醒时间中添加任何明显的延迟。 With idle=mwait , you will get cooler CPU temperatures (as compared to idle=poll ), less power use and still retain the excellent low latencies of a polling idle loop. 使用idle=mwait ，您将获得更低的CPU温度（与idle=poll相比），更少的功耗并仍然保持轮询空闲循环的出色低延迟。 Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is: 因此，基于这些发现，针对低CPU线程唤醒延迟的更新推荐的引导参数集是：

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwait instead of idle=poll may also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). 使用idle=mwait而不是idle=poll也可以帮助启动Turbo Boost（通过帮助CPU保持低于其TDP [热设计功率]）和超线程（MWAIT是不消耗整个功能的理想机制）物理核心，同时避免较高的C状态）。 This has yet to be proven in testing, however, which I will continue to do. 然而，这在测试中尚未得到证实，我将继续这样做。

Update 2: 更新2：

The mwait idle option has been removed from newer 3.x kernels (thanks to user ck_ for the update). mwait idle选项已从较新的3.x内核中删除（感谢用户ck_进行更新）。 That leaves us with two options: 这给我们留下了两个选择：

idle=halt - Should work as well as mwait , but test to be sure that this is the case with your hardware. idle=halt -应工作以及mwait ，但测试，以确保这是你的硬件情况。 The HLT instruction is almost equivalent to an MWAIT with state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. HLT指令几乎等同于具有状态提示0的MWAIT 。问题在于需要中断才能退出HLT状态，而内存写入（或中断）可用于退出MWAIT州。 Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. 根据Linux内核在其空闲循环中使用的内容，这可以使MWAIT更有效。 So, as I said test/profile and see if it meets your latency needs... 所以，正如我所说的测试/配置文件，看它是否满足您的延迟需求......

and 和

idle=poll - The highest performance option, at the expense of power and heat. idle=poll - 性能最高的选项，牺牲功率和热量。

Answer 2

Perhaps what got slower is futex, the building block for condition variables. 也许变慢的是futex，它是条件变量的构建块。 This will shed some light: 这将有所启发：

strace -r ./test_latency 0 1 &> test_latency_strace & sleep 8 && killall test_latency

then 然后

for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done

which will show the microseconds taken for the interesting system calls, sorted by time. 这将显示有趣的系统调用所采用的微秒，按时间排序。

On kernel 2.6.32 在内核2.6.32上

$ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done
futex
 1.000140 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000129 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000124 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000119 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000106 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000103 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000102 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 0.000125 futex(0x7f98ce4c0b88, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 0.000042 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000038 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000037 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000030 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000029 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 0
 0.000028 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000027 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000018 futex(0x7fff82f0ec3c, FUTEX_WAKE_PRIVATE, 1) = 0
nanosleep
 0.000027 nanosleep({1, 0}, {1, 0}) = 0
 0.000019 nanosleep({1, 0}, {1, 0}) = 0
 0.000019 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, 0x7fff82f0eb40) = ? ERESTART_RESTARTBLOCK (To be restarted)
 0.000017 nanosleep({1, 0}, {1, 0}) = 0
rt_sig
 0.000045 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000040 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000038 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000034 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000033 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000032 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000032 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000028 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000028 rt_sigaction(SIGRT_1, {0x37f8c052b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0
 0.000027 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000027 rt_sigaction(SIGRTMIN, {0x37f8c05370, [], SA_RESTORER|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0
 0.000027 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000023 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000022 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
 0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000019 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

On kernel 3.1.9 在内核3.1.9上

$ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done
futex
 1.000129 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000126 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000122 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000115 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000114 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000112 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000109 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 0.000139 futex(0x3f8b8f2fb0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 0.000043 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000041 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000037 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000036 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
nanosleep
 0.000025 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000022 nanosleep({1, 0}, {0, 3925413}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
 0.000021 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
rt_sig
 0.000045 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000044 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000043 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000040 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000038 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000037 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000035 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000034 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigaction(SIGRT_1, {0x3f892067b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0
 0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000024 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000023 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
 0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000019 rt_sigaction(SIGRTMIN, {0x3f89206720, [], SA_RESTORER|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0

I found this 5 year old bug report that contains a "ping pong" performance test that compares 我发现这个5岁的错误报告包含一个比较的“乒乓”性能测试

single-threaded libpthread mutex 单线程libpthread互斥锁
libpthread condition variable libpthread条件变量
plain old Unix signals 普通的旧Unix信号

I had to add 我不得不补充一下

#include <stdint.h>

in order to compile, which I did with this command 为了编译，我用这个命令做了

g++ -O3 -o condvar-perf condvar-perf.cpp -lpthread -lrt

On kernel 2.6.32 在内核2.6.32上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29085 us; per iteration:   29 ns / 9.4e-05 context switches.
c.v. ping-pong test   elapsed:  4771993 us; per iteration: 4771 ns / 4.03 context switches.
signal ping-pong test elapsed:  8685423 us; per iteration: 8685 ns / 4.05 context switches.

On kernel 3.1.9 在内核3.1.9上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    26811 us; per iteration:   26 ns / 8e-06 context switches.
c.v. ping-pong test   elapsed: 10930794 us; per iteration: 10930 ns / 4.01 context switches.
signal ping-pong test elapsed: 10949670 us; per iteration: 10949 ns / 4.01 context switches.

I conclude that between kernel 2.6.32 and 3.1.9 context switch has indeed slowed down, though not as much as you observe in kernel 3.2. 我得出结论，在内核2.6.32和3.1.9之间，上下文切换确实已经放慢了速度，尽管没有你在内核3.2中观察到的那么多。 I realize this doesn't yet answer your question, I'll keep digging. 我意识到这还没有回答你的问题，我会继续挖掘。

Edit: I've found that changing the real time priority of the process (both threads) improves the performance on 3.1.9 to match 2.6.32. 编辑：我发现更改进程的实时优先级（两个线程）可以提高3.1.9的性能以匹配2.6.32。 However, setting the same priority on 2.6.32 makes it slow down... go figure - I'll look into it more. 但是，在2.6.32上设置相同的优先级会让它变慢...去图 - 我会更多地研究它。

Here's my results now: 这是我现在的结果：

On kernel 2.6.32 在内核2.6.32上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29629 us; per iteration:   29 ns / 0.000418 context switches.
c.v. ping-pong test   elapsed:  6225637 us; per iteration: 6225 ns / 4.1 context switches.
signal ping-pong test elapsed:  5602248 us; per iteration: 5602 ns / 4.09 context switches.
$ chrt -f 1 ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29049 us; per iteration:   29 ns / 0.000407 context switches.
c.v. ping-pong test   elapsed: 16131360 us; per iteration: 16131 ns / 4.29 context switches.
signal ping-pong test elapsed: 11817819 us; per iteration: 11817 ns / 4.16 context switches.
$

On kernel 3.1.9 在内核3.1.9上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    26830 us; per iteration:   26 ns / 5.7e-05 context switches.
c.v. ping-pong test   elapsed: 12812788 us; per iteration: 12812 ns / 4.01 context switches.
signal ping-pong test elapsed: 13126865 us; per iteration: 13126 ns / 4.01 context switches.
$ chrt -f 1 ./condvar-perf 1000000
NPTL
mutex                 elapsed:    27025 us; per iteration:   27 ns / 3.7e-05 context switches.
c.v. ping-pong test   elapsed:  5099885 us; per iteration: 5099 ns / 4 context switches.
signal ping-pong test elapsed:  5508227 us; per iteration: 5508 ns / 4 context switches.
$

Answer 3

You might also see processors clicking down in more recent processes and Linux kernels due to the pstate driver which is separate from c-states. 由于与c状态分开的pstate驱动程序，您可能还会看到处理器在更新的进程和Linux内核中单击。 So in addition, to disable this, you the following kernel parameter: 所以另外，要禁用它，你需要以下内核参数：

intel_pstate=disable

在新的Linux内核中，上下文切换速度要慢得多

问题描述

3 个解决方案

解决方案1
93 已采纳 2012-08-24 22:10:54

解决方案2
8 2012-08-24 16:43:25

解决方案3
0 2015-08-11 15:08:37

在新的Linux内核中，上下文切换速度要慢得多

问题描述

3 个解决方案

解决方案1 93 已采纳 2012-08-24 22:10:54

解决方案2 8 2012-08-24 16:43:25

解决方案3 0 2015-08-11 15:08:37

解决方案1
93 已采纳 2012-08-24 22:10:54

解决方案2
8 2012-08-24 16:43:25

解决方案3
0 2015-08-11 15:08:37