简体   繁体   English

Linux性能工具显示奇怪的缓存未命中结果

[英]Linux perf tool showing weird cache miss results

I'm using linux perf tools to profile one of CRONO benchmarks , I'm specifically interested in L1 DCache Misses, so I run the program like this: 我正在使用linux perf工具来分析CRONO基准测试之一 ,我特别对L1 DCache Misses感兴趣,所以我这样运行程序:

perf record -e L1-dcache-read-misses -o perf/apsp.cycles apps/apsp/apsp 4 16384 16

It runs fine but generates those warnings: 它运行正常,但会生成以下警告:

WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Cannot read kernel map
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.

Threads Returned!
Threads Joined!
Time: 2.932636 seconds
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.709 MB perf/apsp.cycles (44765 samples) ]

I then annotate the output file like this: 然后,我这样注释输出文件:

perf annotate --stdio -i perf/apsp.cycles --dsos=apsp

But in one of the code sections, I see some weird results: 但是在其中一个代码部分中,我看到了一些奇怪的结果:

Percent |      Source code & Disassembly of apsp for L1-dcache-read-misses
---------------------------------------------------------------------------
         :               {
         :                  if((D[W_index[v][i]] > (D[v] + W[v][i])))
   19.36 :        401140:       movslq (%r10,%rcx,4),%rsi
   14.50 :        401144:       lea    (%rax,%rsi,4),%rdi
    1.22 :        401148:       mov    (%r9,%rcx,4),%esi
    5.82 :        40114c:       add    (%rax,%r8,4),%esi
   20.02 :        401150:       cmp    %esi,(%rdi)
    0.00 :        401152:       jle    401156 <do_work(void*)+0x226>
         :                     D[W_index[v][i]] = D[v] + W[v][i];
    9.72 :        401154:       mov    %esi,(%rdi)
   19.93 :        401156:       add    $0x1,%rcx
         :

Now in those results, How come that some arithmetic instructions have L1 read misses? 现在,在这些结果中,为什么某些算术指令发生了L1读错误? Also, how come that instructions of the second statement cause so many cache misses even though they should've brought into cache by the previous if statement? 另外,第二条语句的指令为什么会导致如此多的高速缓存未命中,即使它们应该由前一个if语句带入高速缓存中呢? Am I doing something wrong here? 我在这里做错什么了吗? I tried the same on a different machine with root access, it gave me similar results, so I think the warnings I mentioned above are not causing this. 我在具有root用户访问权限的另一台计算机上尝试了相同的操作,它给了我类似的结果,所以我认为我上面提到的警告并没有引起这种情况。 But what exactly is going on? 但是到底是怎么回事?

So we have this code: 因此,我们有以下代码:

for(v=0;v<N;v++)
{
    for(int i = 0; i < DEG; i++)
    {
        if((/* (V2) 1000000000 * */ D[W_index[v][i]] > (D[v] + W[v][i])))
            D[W_index[v][i]] = D[v] + W[v][i];

        Q[v]=0; //Current vertex checked
    }
}

Note that I added (V2) as a comment in the code. 请注意,我在代码中添加了(V2)作为注释。 We below come back to this code. 下面我们回到此代码。

First approximation 一阶近似

Remember that W_index is initialized as W_index[i][j] = i + j (A) . 请记住, W_index初始化为W_index[i][j] = i + j (A)

Let's focus on one inner iteration, and first let's assume that DEG is large. 让我们关注一个内部迭代,首先让我们假设DEG很大。 Further we assume that the cache is large enough to hold all data for at least two iterations. 此外,我们假设缓存足够大,可以保存所有数据至少两次迭代。

D[W_index[v][i]]

The lookup W_index[v] is loaded into a register. 查找W_index[v]被加载到寄存器中。 For W_index[v][i] we assume one cache miss (64 byte cache line, 4 byte per int, we call the programm with DIM=16). 对于W_index[v][i]我们假设一个高速缓存未命中(64字节高速缓存行,每个int 4字节,我们用DIM = 16调用程序)。 The lookup in D starts always at v , so most of the required part of the array is already in cache. D的查找始终始于v ,因此数组的大部分必需部分已经在缓存中。 With the assumption that DEG is large this lookup is for free. 假设DEG很大,则此查找是免费的。

D[v] + W[v][i]

The lookup D[v] is for free as it depends on v . 查找D[v]是免费的,因为它取决于v The second lookup is the same as above, one cache miss for the second dimension. 第二次查找与上面相同,第二个维度有一个高速缓存未命中。

The whole inner statement has no influence. 整个内部陈述没有影响力。

Q[v]=0;

As this is v , this can be ignored. 由于这是v ,因此可以忽略。

When we sum up, we get two cache misses. 总结起来,我们会得到两个缓存未命中。

Second approximation 二阶近似

Now, we come back to the assumption that DEG is large. 现在,我们回到DEG很大的假设。 In fact this is wrong because DEG = 16 . 实际上这是错误的,因为DEG = 16 So there are fractions of cache misses we also need to consider. 因此,我们还需要考虑一些缓存未命中的问题。

D[W_index[v][i]]

The lookup W_index[v] costs 1/8 of a cache miss (it has a size of 8 bytes, a cache line is 64 byte, so we get a cache miss each eigth iteration). 查找W_index[v]花费1/8的高速缓存未命中(其大小为8字节,高速缓存行为64字节,因此我们在每个第八次迭代中都会得到一个高速缓存未命中)。

The same is true for D[W_index[v][i]] , except that D holds integers. D[W_index[v][i]] ,除了D保留整数。 In average all but one integer are in cache, so this costs 1/16 of a cache miss. 平均而言,除一个整数外,所有整数都在高速缓存中,因此这花费了高速缓存未命中的1/16。

D[v] + W[v][i]

D[v] is already in cache (this is W_index[v][0] ). D[v]已在缓存中(这是W_index[v][0] )。 But we get another 1/8 of a cache miss for W[v] for the same reasoning as above. 但是由于上述相同的原因,我们又得到了W[v] 1/8的高速缓存未命中。

Q[v]=0;

This is another 1/16 of a cache miss. 这是缓存未命中率的另外1/16。

And surprize, if we now use the code (V2) where the if -clause never evaluates to true , I get 2.395 cache misses per iteration (note that you really need to configure your CPU well, ie, no hyperthreading, no turboboost, performance governor if possible). 令人惊讶的是,如果我们现在使用代码(V2),其中if -clause永远不会评估为true ,那么每次迭代我都会遇到2.395高速缓存未命中(请注意,您确实需要很好地配置CPU,即没有超线程,没有turboboost,性能调速器)。 The calculation above would lead to 2.375. 上面的计算将得出2.375。 So we are pretty good. 所以我们很好。

Third approximation 第三近似

Now there is this unfortunate if clause. 现在有一个不幸的if子句。 How often does this comparison evaluate to true . 此比较多久一次会得出true We can't say, in the beginning it will be quite often, and in the end it will never evaluate to true . 我们不能说,一开始它会经常出现,而到最后它永远不会评估为true

So let's focus on the really first execution of the complete loop. 因此,让我们关注完整循环的真正首次执行。 In this case, D[v] is infinity and W[v][i] is a number between 1 and 101. So the loop evaluates to true in each iteration. 在这种情况下, D[v]是无穷大, W[v][i]是1到101之间的一个数字。因此,循环在每次迭代中都为true

And then it gets hard - we get 2.9 cache misses in this iteration. 然后变得很困难-在此迭代中,我们得到2.9个缓存未命中。 Where are they coming from - all data should already be in cache. 它们来自何处-所有数据应该已经在缓存中。

But : This is the "mystery of compilers". 但是 :这是“编译器之谜”。 You never know what they produce in the end. 您永远都不知道它们最终会产生什么。 I compiled with GCC and Clang and get the same measures. 我使用GCC和Clang进行编译,并得到相同的度量。 I activate -funroll-loops , and suddenly I get 2.5 cache misses. 我激活-funroll-loops ,突然我得到2.5次缓存未命中。 Of course this may be different on your system. 当然,这在您的系统上可能有所不同。 When I inspected the assembly, I observed that it is really exactly the same, just the loop has been unrolled four times. 当我检查装配时,我发现它实际上是完全相同的,只是循环已经展开了四次。

So what does this tell us? 那这告诉我们什么呢? You never know what your compiler does except you check it. 除了检查之外,您永远都不知道编译器会做什么。 And even then, you can't be sure. 即使那样,您也不能确定。

I guess hardware prefetching or execution order could have an influence here. 我猜硬件预取或执行顺序可能会在这里产生影响。 But this is a mystery. 但这是一个谜。

Regarding perf and your problems with it 关于性能及其问题

I think the measurements you did have two problems: 我认为您所做的测量存在两个问题:

  • They are relative, the exact line is not that accurate. 他们是相对的,确切的路线不是那么准确。
  • You are multithreaded, this may be harder to track. 您是多线程的,这可能很难跟踪。

My experience is that when you want to get good measures for a specific part of your code, you really need to check it manually. 我的经验是,当您想对代码的特定部分获得良好的衡量标准时,确实需要手动检查它。 Sometimes - not always - it can explain things pretty good. 有时(并非总是如此),它可以解释得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM