使用 RT Kernel 的调度程序中具有高优先级的程序的执行时间不一致

Question

Problem问题

We are trying to implement a program that sends commands to a robot in a given cycle time.我们正在尝试实现一个在给定的周期时间内向机器人发送命令的程序。 Thus this program should be a real-time application.因此这个程序应该是一个实时应用程序。 We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority.我们设置了一台带有抢占 RT Linux 内核的 PC，并使用 chrt -f 98 或 chrt -rr 99 启动我们的程序来定义调度策略和优先级。 Loading of the kernel and launching of the program seems to be fine and work (see details below).加载内核和启动程序似乎很好并且可以正常工作（请参阅下面的详细信息）。

Now we were measuring the time (CPU ticks) it takes our program to be computed.现在我们正在测量计算程序所需的时间（CPU 滴答声）。 We expected this time to be constant with very little variation.我们预计这个时间是恒定的，变化很小。 What we measured though, were quite significant differences in computation time.但是，我们测量的是计算时间的显着差异。 Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well.当然，我们认为这可能是我们相当复杂的程序中未定义的行为，因此我们创建了一个非常基本的程序并测量了时间。 The behavior was similarly bad.行为同样糟糕。

Question问题

Why are we not measuring a (close to) constant computation time even for our basic program?为什么即使对于我们的基本程序，我们也没有测量（接近）恒定的计算时间？
How can we solve this problem?我们如何解决这个问题？

Environment Description环境描述

First of all, we installed an RT Linux Kernel on the PC using this tutorial .首先，我们使用本教程在 PC 上安装了 RT Linux Kernel。 The main characteristics of the PC are: PC的主要特点是：

PC Characteristics电脑特性	Details细节
CPU中央处理器	Intel(R) Atom(TM) Processor E3950 @ 1.60GHz with 4 cores Intel(R) Atom(TM) 处理器 E3950 @ 1.60GHz 4 核
Memory RAM内存 RAM	8 GB 8 GB
Operating System操作系统	Ubunut 20.04.1 LTS乌布努特 20.04.1 LTS
Kernel核心	Linux 5.9.1-rt20 SMP PREEMPT_RT Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture建筑学	x86-64 x86-64

Tests测试

The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread.我们第一次发现这个问题是在我们测量用单线程执行这个“复杂”程序所需的时间时。 We did a few tests with this program but also with a simpler one:我们用这个程序做了一些测试，但也用了一个更简单的：

The CPU execution times CPU 执行次数
The wall time (the world real-time)墙上时间（世界实时）
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).它们之间的差异（Wall time - CPU time）和比率（CPU time / Wall time）。

We also did a latency test on the PC.我们还在 PC 上进行了延迟测试。

Latency Test延迟测试

For this one, we followed this tutorial , and these are the results:对于这一点，我们遵循了本教程，结果如下：

Latency Test Generic Kernel延迟测试通用内核

Latency Test RT Kernel延迟测试 RT 内核

The processes are shown in htop with a priority of RT进程显示在 htop 中，优先级为 RT

Test Program - Complex测试程序 - 复杂

We called the function multiple times in the program and measured the time each takes.我们在程序中多次调用该函数并测量每次所花费的时间。 The results of the 2 tests are: 2次测试的结果是：

From this we observed that:由此我们观察到：

The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.第一次执行（大约 0.28 毫秒）总是比第二次（大约 0.18 毫秒）花费更长的时间，但大多数时候它不是最长的迭代。
The mode is around 0.17 ms.模式约为 0.17 ms。
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time.对于那些需要 17 毫秒的时间，差异通常为 0，比率为 1。虽然这不是这次独有的。 For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).对于这些，似乎只使用了 1 个 CPU 并且已经饱和（没有等待时间）。
When the difference is not 0, it is usually negative.当差值不为 0 时，通常为负数。 This, from what we have read here and here , is because more than 1 CPU is being used.从我们在这里和这里读到的内容来看，这是因为使用了超过 1 个 CPU。

Test Program - Simple测试程序 - 简单

We did the same test but this time with a simpler program:我们做了同样的测试，但这次使用了一个更简单的程序：

#include <vector>
#include <iostream>
#include <time.h>

int main(int argc, char** argv) {
    int iterations = 5000;
    double a = 5.5;
    double b = 5.5;
    double c = 4.5;
    std::vector<double> wallTime(iterations, 0);
    std::vector<double> cpuTime(iterations, 0);
    struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;

    std::cout << "Iteration | WallTime | cpuTime" << std::endl;

    for (unsigned int i = 0; i < iterations; i++) {
        // Start measuring time
        clock_gettime(CLOCK_REALTIME, &beginWallTime);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);

        // Function
        a = b + c + i;

        // Stop measuring time and calculate the elapsed time
        clock_gettime(CLOCK_REALTIME, &endWallTime);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);

        wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
        cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;

        std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
    }
    return 0;
}

Final Thoughts最后的想法

We understand that:我们明白：

If the ratio == number of CPUs used, they are saturated and there is no waiting time.如果比率 == 使用的 CPU 数量，则它们已饱和并且没有等待时间。
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).如果比率 < 使用的 CPU 数量，则意味着有一些等待时间（理论上我们应该只使用 1 个 CPU，尽管实际上我们使用更多）。

Of course, we can give more details.当然，我们可以提供更多细节。

Thanks a lot for your help!非常感谢你的帮助！

Answer 1

Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks.你的功能肯定会被优化掉，所以你只是在测量读取时钟需要多长时间。 And as you can see that doesn't take very long with some exceptions:正如你所看到的，除了一些例外情况不会花费很长时间：

The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk.第一次运行代码时（除非您刚刚编译它），页面需要从磁盘加载。 If you are unlucky the code spans pages and you include the loading of the next page in the measured time.如果运气不好，代码会跨越页面，并且在测量的时间内包括加载下一页。 Quite unlikely given the code size.考虑到代码大小，这不太可能。

The first loop the code and any data needs to be loaded into cache.第一个循环代码和任何数据都需要加载到缓存中。 So that takes longer to execute.所以这需要更长的时间来执行。 The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.分支预测器可能还需要几个循环来正确预测循环，因此第二个、第三个循环也可能稍长一些。

For everything else I think you can blame scheduling:对于其他一切，我认为你可以责怪调度：

an IRQ happens but nothing gets rescheduled IRQ 发生，但没有重新安排
the process gets paused while another process runs该进程在另一个进程运行时暂停
the process gets moved to another CPU thread leaving the caches hot进程被移动到另一个 CPU 线程，使高速缓存变热
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)该进程被移动到另一个 CPU 内核，使 L1 缓存变冷但使 L2/L3 缓存变热（如果您的 L2 是共享的）
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)该进程被移动到另一个套接字上的 CPU，使 L1/L2 缓存变冷但 L3 缓存变热（如果 L3 是共享的）

You can do little about IRQs.您对 IRQ 无能为力。 Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself).有些你可以修复到特定的内核，但其他的只是必不可少的（比如调度程序本身的定时器中断）。 You kind of just have to live with that.你只需要忍受它。

But you can fix your program to a specific CPU and you can fix everything else to all the other cores.但是您可以将程序修复到特定的 CPU，您可以将其他所有内容修复到所有其他内核。 Basically reserving the core for the real-time code.基本上为实时代码保留核心。 I guess you would have to use cgroups for this, to keep everything else off the chosen core.我想您将不得不为此使用 cgroups，以使其他所有内容远离所选核心。 And you might still get some kernel threads run on the reserved core.而且您可能仍然会在保留的核心上运行一些内核线程。 Nothing you can do about that.对此你无能为力。 But that should eliminate most of the large execution times.但这应该消除大部分大的执行时间。

使用 RT Kernel 的调度程序中具有高优先级的程序的执行时间不一致

问题描述

Problem问题

Question问题

Environment Description环境描述