在linux内核模块中寻找意外抢占的原因

Question

I have a small linux kernel module that is a prototype for a device driver for hardware that doesn't exist yet. 我有一个小的Linux内核模块，它是设备驱动程序的原型，该设备驱动程序还不存在。 The code needs to do a short bit of computation as fast as possible from beginning to end with a duration that is a few microseconds. 该代码需要从头到尾以最快的速度执行几秒钟的计算。 I am trying to measure whether this is possible with the intel rdtscp instruction using an ndelay() call to simulate the computation. 我正在尝试使用ndelay()调用来模拟intel rdtscp指令来测量这是否可能。 I find that 99.9% of the time it runs as expected, but 0.1% of the time it has a very large delay that appears as if something else is preempting the code despite running inside a spinlock which should be disabling interrupts. 我发现它有99.9％的时间按预期运行，但有0.1％的时间却有很大的延迟，尽管运行在应该禁用中断的自旋锁中，但似乎有其他东西抢占了代码。 This is run using a stock Ubuntu 64 bit kernel (4.4.0-112) with no extra realtime or low latency patches. 这是使用现有的Ubuntu 64位内核（4.4.0-112）运行的，没有额外的实时或低延迟补丁程序。

Here is some example code that replicates this behavior. 这是一些复制此行为的示例代码。 This is written as a handler for a /proc filesystem entry for easy testing, but I have only shown the function that actually computes the delays: 这是作为/proc文件系统条目的处理程序编写的，以便于测试，但是我只展示了实际计算延迟的函数：

#define ITERATIONS 50000
#define SKIPITER 10
DEFINE_SPINLOCK(timer_lock);
static int timing_test_show(struct seq_file *m, void *v) 
{
  uint64_t i;
  uint64_t first, start, stop, delta, max=0, min=1000000;
  uint64_t avg_ticks;
  uint32_t a, d, c;
  unsigned long flags;
  int above30k=0;

  __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
  first = a | (((uint64_t)d)<<32);
  for (i=0; i<ITERATIONS; i++) {
    spin_lock_irqsave(&timer_lock, flags);
    __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
    start = a | (((uint64_t)d)<<32);
    ndelay(1000);
    __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
    stop = a | (((uint64_t)d)<<32);
    spin_unlock_irqrestore(&timer_lock, flags);
    if (i < SKIPITER) continue;
    delta = stop-start;
    if (delta < min) min = delta;
    if (delta > max) max = delta;
    if (delta > 30000) above30k++;
  }
  seq_printf(m, "min: %llu max: %llu above30k: %d\n", min, max, above30k);
  avg_ticks = (stop - first) / ITERATIONS;
  seq_printf(m, "Average total ticks/iteration: %llu\n", avg_ticks);
  return 0;
}

Then if I run: 然后，如果我运行：

# cat /proc/timing_test
min: 4176 max: 58248 above30k: 56
Average total ticks/iteration: 4365

This is on a 3.4 GHz sandy bridge generation Core i7. 这是在3.4 GHz沙桥一代Core i7上。 The ~4200 ticks of the TSC is about right for a little over 1 microsecond delay. TSC的〜4200个滴答大约延迟了1微秒。 About 0.1% of the time I see delays about 10x longer than expected, and in some cases I have seen times as long as 120,000 ticks. 大约有0.1％的时间，我看到延迟时间比预期的要长10倍左右，在某些情况下，我看到的时间长达120000分钟。

These delays appear too long to be a single cache miss, even to DRAM. 这些延迟似乎太长，甚至对DRAM而言，都不会成为单个缓存丢失。 So I think it either has to be several cache misses, or another task preempting the CPU in the middle of my critical section. 因此，我认为这要么是几次高速缓存未命中，要么是另一个在我关键部分中间抢占CPU的任务。 I would like to understand the possible causes of this to see if they are something we can eliminate or if we have to move to a custom processor/FPGA solution. 我想了解造成这种情况的可能原因，以查看是否可以消除它们，或者是否必须转向定制处理器/ FPGA解决方案。

Things I have tried: 我尝试过的事情：

I considered if this could be caused by cache misses. 我考虑过这是否可能是由高速缓存未命中引起的。 I don't think that could be the case since I ignore the first few iterations which should load the cache. 我认为情况并非如此，因为我忽略了应该加载缓存的前几次迭代。 I have verified by examining disassembly that there are no memory operations between the two calls to rdtscp, so I think the only possible cache misses are for the instruction cache. 通过检查反汇编，我已经验证了对rdtscp的两次调用之间没有内存操作，因此我认为唯一可能的缓存未命中是针对指令缓存。
Just in case, I moved the spin_lock calls around the outer loop. 以防万一，我在外部循环中移动了spin_lock调用。 Then it shouldn't be possible to have any cache misses after the first iteration. 这样，在第一次迭代后就不可能有任何缓存未命中。 However, this made the problem worse . 但是，这使问题变得更糟 。
I had heard that the SMM interrupt is unmaskable and mostly transparent and could cause unwanted preemption. 我听说SMM中断是不可屏蔽的，并且大多数情况下是透明的，并且可能导致不必要的抢占。 However, you can read the SMI interrupt count with rdmsr on MSR_SMI_COUNT . 但是，您可以使用rdmsr上的MSR_SMI_COUNT读取SMI中断计数。 I tried adding that before and after and there are no SMM interrupts happening while my code is executing. 我尝试在添加之前和之后添加该代码，并且在执行代码时没有发生SMM中断。
I understand there are also inter-processor interrupts in SMP systems that may interrupt, but I looked at /proc/interrupts before and after and don't see enough of them to explain this behavior. 我了解SMP系统中也存在处理器间中断，这些中断可能会中断，但是我之前和之后都查看了/ proc / interrupts，但看不到足够多的中断来解释此行为。
I don't know if ndelay() takes into account variable clock speed, but I think the CPU clock only varies by a factor of 2, so this should not cause a >10x change. 我不知道ndelay()考虑了可变的时钟速度，但是我认为CPU时钟仅变化2倍，因此这不应引起> 10倍的变化。
I booted with nopti to disable page table isolation in case that is causing problems. 我使用nopti引导以禁用页表隔离，以防引起问题。

Answer 1

Another thing that I have just noticed is that it is unclear what ndelay() does. 我刚刚注意到的另一件事是，不清楚ndelay()作用。 Maybe you should show it so as non-trivial problems may be lurking inside it. 也许您应该显示它，以便在其中隐藏一些非凡的问题。

For example, I've observed once that my piece of a kernel driver code was still preempted when it had a memory leak inside it, so as soon as it hit some watermark limit, it was put aside even if it disabled interrupts. 例如，我曾经观察到我的一部分内核驱动程序代码在内部内存泄漏时仍被抢占，因此，一旦达到某个水印限制，即使禁用了中断，它也会被搁置一旁。

Answer 2

120,000 ticks that you observed in extreme cases sounds a lot like an SMM handler. 您在极端情况下观察到的120,000滴答声听起来很像SMM处理程序。 Lesser values might have been caused by an assortment of microarchitectural events (by the way, have you checked all the performance counters available to you?), but this is something that must be caused by a subroutine written by someone who was not writing his/her code to achieve minimal latency. 较小的值可能是由各种各样的微体系结构事件引起的（顺便说一句，您是否检查了所有可用的性能计数器？），但这一定是由未编写其/的人编写的子例程引起的。她的代码以实现最小延迟。

However you stated that you've checked that no SMIs are observed. 但是，您表示已检查没有观察到SMI。 This leads me to think that either something is wrong with kernel facilities to count/report them, or with your method to look after them. 这使我认为内核计数或报告它们的方法有问题，或者是您照顾它们的方法有问题。 Hunting after SMI without a hardware debugger may be a frustrating endeavor. 在没有硬件调试器的情况下，在SMI之后进行搜索可能会令人沮丧。

Was SMI_COUNT not changing during your experiment course, or was it exactly zero all the time? SMI_COUNT是否在您的实验过程中没有变化，还是一直都是零？ The latter might indicate that it does not count anything, unless you system is completely free from SMI, which I doubt of in case of regular Sandy Bridge. 后者可能表明它不算什么，除非您的系统完全没有SMI，对于常规的Sandy Bridge，我对此表示怀疑。
It may be that SMIs are delivered to another core in your system, and an SMM handler is synchronizing other cores through some sort of mechanism that does not show up on SMI_COUNT. 可能是SMI传递到了系统中的另一个核心，并且SMM处理程序正在通过某种未在SMI_COUNT上显示的机制来同步其他核心。 Have you checked other cores? 您是否检查过其他核心？
In general I would recommend starting downsizing your system under test to exclude as much of stuff as possible. 总的来说，我建议您开始缩减受测系统的尺寸，以尽可能地排除所有东西。 Have you tried booting it with a single core and no hyperthreading enabled in BIOS? 您是否尝试过通过单核引导它并且没有在BIOS中启用超线程？ Have you tried to run the same code on a system that is known to not have SMIs? 您是否尝试过在没有SMI的系统上运行相同的代码？ The same goes with disabling Turbo Boost and Frequency scaling in BIOS. 在BIOS中禁用Turbo Boost和频率缩放也是如此。 As much as possible of timing-related must go. 尽可能与时序相关的必须去。

Answer 3

FYI, in my system: 仅供参考，在我的系统中：

timingtest % uname -a
Linux xxxxxx 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Replicating your example (with ndelay(1000);) I get: 复制您的示例（使用ndelay（1000）;），我得到：

timingtest % sudo cat /proc/timing_test
min: 3783 max: 66883 above30k: 20
Average total ticks/iteration: 4005

timingtest % sudo cat /proc/timing_test
min: 3783 max: 64282 above30k: 19
Average total ticks/iteration: 4010

Replicating your example (with udelay(1);) I get: 复制您的示例（使用udelay（1）;），我得到：

timingtest % sudo cat /proc/timing_test
min: 3308 max: 43301 above30k: 2
Average total ticks/iteration: 3611

timingtest % sudo cat /proc/timing_test
min: 3303 max: 44244 above30k: 2
Average total ticks/iteration: 3600

ndelay(),udelay(),mdelay() are for use in atomic context as stated here: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt They all rely on __const_udelay() funtion that is a vmlinux exported symbol (using: LFENCE/RDTSC instructions). ndelay（），udelay（），mdelay（）用于原子上下文，如下所示： https ://www.kernel.org/doc/Documentation/timers/timers-howto.txt它们都依赖于__const_udelay（）函数这是vmlinux导出的符号（使用：LFENCE / RDTSC指令）。

Anyway, I replaced the delay with: 无论如何，我将延迟替换为：

for (delta=0,c=0; delta<500; delta++) {c++; c|=(c<<24); c&=~(c<<16);}

for a trivial busy loop, with the same results. 一个微不足道的繁忙循环，结果相同。

I also tryed with _cli()/_sti(), local_bh_disable()/local_bh_enable() and preempt_disable()/preempt_enable() without success. 我还尝试了_cli（）/ _ sti（），local_bh_disable（）/ local_bh_enable（）和preempt_disable（）/ preempt_enable（），但未成功。

Examinig SMM interrupts (before and after delay) with: Examinig SMM中断（延迟之前和之后）具有：

__asm__ volatile ("rdmsr" : "=a" (a), "=d" (d) : "c"(0x34) : );
smi_after = (a | (((uint64_t)d)<<32));

I always obtain the same number (no SMI or register not updated). 我总是获得相同的号码（没有SMI或注册未更新）。

Executing the cat command with trace-cmd to explore what's happening, I get results suprissingly not so scattered in time. 使用trace-cmd执行cat命令来探索正在发生的事情，我得到的结果令人惊讶地不是时间上如此分散。 (!?) （！？）

timingtest % sudo trace-cmd record -o trace.dat -p function_graph cat /proc/timing_test 
  plugin 'function_graph'
min: 3559 max: 4161 above30k: 0
Average total ticks/iteration: 5863
...

In my system, the problem can be solved making use of Power management Quality of Service, see ( https://access.redhat.com/articles/65410 ). 在我的系统中，可以通过使用电源管理服务质量来解决该问题，请参阅（ https://access.redhat.com/articles/65410 ）。 Hope this helps 希望这可以帮助

在linux内核模块中寻找意外抢占的原因

问题描述

3 个解决方案

解决方案1
1 2018-02-12 07:24:26

解决方案2
0 2018-02-12 07:21:38

解决方案3
0 2018-12-13 13:22:57

在linux内核模块中寻找意外抢占的原因

问题描述

3 个解决方案

解决方案1 1 2018-02-12 07:24:26

解决方案2 0 2018-02-12 07:21:38

解决方案3 0 2018-12-13 13:22:57

解决方案1
1 2018-02-12 07:24:26

解决方案2
0 2018-02-12 07:21:38

解决方案3
0 2018-12-13 13:22:57