简体   繁体   English

更好的基准同步

[英]Better synchronization for benchmarking

I'm trying to benchmark some interrupt functionality that I added to a kernel. 我正在尝试对添加到内核的某些中断功能进行基准测试。 For the time being, I just want to measure how long it takes for an interrupt to be sent from one core and received on another. 就目前而言,我只想测量从一个内核发送一个中断并在另一个内核接收一个中断所花费的时间。 I'm roughly doing the following: 我正在大致执行以下操作:

volatile bool wait = true;

...

//Sending core:
void run_benchmark() {
    //clear pipeline and record time A with rdtsc
    for (int i = 0; i < 10000; i++) {
        send_interrupt();
        while (wait);
        wait = true;
    }
    //record time B with rdtsc
    //benchmark = (B - A) / 10000
}

...

//Receiving core:
void handle_interrupt(...) {
    wait = false;
    ...
}

I also subtract other overheads out of the benchmark, such as the cost of recording a time, etc. I send the interrupt 10,000 times in order to get a stable value. 我还从基准中减去其他开销,例如记录时间的成本等。为了获得稳定的值,我发送了10,000次中断。

My main concern with this approach is that there will be a cache miss on both the receiving core and the sending core, since they each set wait to a different value. 我对这种方法的主要担心是,由于接收核心和发送核心的wait时间不同,因此接收核心和发送核心都将发生高速缓存未命中。 Given how fast interrupt delivery already is, these cache misses are likely having a significant effect on my recorded benchmark. 鉴于中断传递已经有多快,这些高速缓存未命中可能会对我记录的基准产生重大影响。

Is there a better way to do this? 有一个更好的方法吗?

On newer Intel platforms the TSC's of all cores should be synchronized under Linux. 在较新的英特尔平台上,所有内核的TSC都应在Linux下同步。 So I don't think you need this kind of synchronization (see corresponding thread in the Intel developer zone). 因此,我认为您不需要这种同步(请参阅Intel开发人员专区中的相应线程)。

Why dont you simply take the TSC value on the receiving CPU? 为什么不简单地在接收CPU上获取TSC值? Then you can wait on the sending CPU until the variable for time B contains a plausible value. 然后,您可以等待发送CPU,直到时间B的变量包含合理值为止。

Yes, both L1 caches on each core will be flapping because each one is invalidating each other's cache lines. 是的,每个内核上的两个L1高速缓存都将发生震荡,因为每个内核都使彼此的高速缓存行无效。

I do not know how did you implemented the send_interrupt() but if it is by using some shared variable, then you also need to consider that there is cache evictions for this communication, which incurs in some delays on measured values. 我不知道您是如何实现send_interrupt()但是如果通过使用某些共享变量来实现,则还需要考虑该通信存在缓存逐出,这会导致测量值出现一些延迟。

About measuring the execution time,I was working recently this stuff, here is the code I used to measure clock ticks: 关于测量执行时间,我最近正在研究这个东西,这是我用来测量时钟滴答的代码:

uint64_t rdtscp(uint64_t *chip, uint64_t *core)
{
    uint32_t a, d, c;

    __asm__ volatile("rdtscp" : "=a" (a),"=d" (d), "=c" (c));
    *chip = (c & 0xFFF000) >> 12;
    *core = c & 0xFFF;

    return ((uint64_t) a) | (((uint64_t) d) << 32);
}

Basically the RDTSCP is synchronized over all the cores that share the same chipset in a NUMA architecture, and the rdtscp instruction uses three different CPU registers to form the 64-bits timestamp, that is why it requires those three variables in the function. 基本上,RDTSCP在NUMA架构中共享同一芯片组的所有内核上都是同步的,并且rdtscp指令使用三个不同的CPU寄存器来形成64位时间戳,这就是为什么它在函数中需要这三个变量的原因。

Nonetheless, you can get other useful information from this instruction, such as the core id and chip id, for instance. 尽管如此,您仍可以从该指令中获取其他有用的信息,例如核心ID和芯片ID。

So, you can use this timestamp to measure the time your code takes. 因此,您可以使用此时间戳来衡量代码花费的时间。

Other thing that may be useful is to pin your process to a specific CPU-core, because it may be scheduled to a different core, so it also incurs in task migration and so on. 可能有用的另一件事是将您的进程固定到特定的CPU内核,因为它可能被调度到其他内核,因此也会导致任务迁移等。

PS: its resolution is in CPU clock ticks. PS:其分辨率以CPU时钟滴答为单位。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM