简体   繁体   中英

Linux perf tool showing weird cache miss results

I'm using linux perf tools to profile one of CRONO benchmarks , I'm specifically interested in L1 DCache Misses, so I run the program like this:

perf record -e L1-dcache-read-misses -o perf/apsp.cycles apps/apsp/apsp 4 16384 16

It runs fine but generates those warnings:

WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Cannot read kernel map
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.

Threads Returned!
Threads Joined!
Time: 2.932636 seconds
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.709 MB perf/apsp.cycles (44765 samples) ]

I then annotate the output file like this:

perf annotate --stdio -i perf/apsp.cycles --dsos=apsp

But in one of the code sections, I see some weird results:

Percent |      Source code & Disassembly of apsp for L1-dcache-read-misses
---------------------------------------------------------------------------
         :               {
         :                  if((D[W_index[v][i]] > (D[v] + W[v][i])))
   19.36 :        401140:       movslq (%r10,%rcx,4),%rsi
   14.50 :        401144:       lea    (%rax,%rsi,4),%rdi
    1.22 :        401148:       mov    (%r9,%rcx,4),%esi
    5.82 :        40114c:       add    (%rax,%r8,4),%esi
   20.02 :        401150:       cmp    %esi,(%rdi)
    0.00 :        401152:       jle    401156 <do_work(void*)+0x226>
         :                     D[W_index[v][i]] = D[v] + W[v][i];
    9.72 :        401154:       mov    %esi,(%rdi)
   19.93 :        401156:       add    $0x1,%rcx
         :

Now in those results, How come that some arithmetic instructions have L1 read misses? Also, how come that instructions of the second statement cause so many cache misses even though they should've brought into cache by the previous if statement? Am I doing something wrong here? I tried the same on a different machine with root access, it gave me similar results, so I think the warnings I mentioned above are not causing this. But what exactly is going on?

So we have this code:

for(v=0;v<N;v++)
{
    for(int i = 0; i < DEG; i++)
    {
        if((/* (V2) 1000000000 * */ D[W_index[v][i]] > (D[v] + W[v][i])))
            D[W_index[v][i]] = D[v] + W[v][i];

        Q[v]=0; //Current vertex checked
    }
}

Note that I added (V2) as a comment in the code. We below come back to this code.

First approximation

Remember that W_index is initialized as W_index[i][j] = i + j (A) .

Let's focus on one inner iteration, and first let's assume that DEG is large. Further we assume that the cache is large enough to hold all data for at least two iterations.

D[W_index[v][i]]

The lookup W_index[v] is loaded into a register. For W_index[v][i] we assume one cache miss (64 byte cache line, 4 byte per int, we call the programm with DIM=16). The lookup in D starts always at v , so most of the required part of the array is already in cache. With the assumption that DEG is large this lookup is for free.

D[v] + W[v][i]

The lookup D[v] is for free as it depends on v . The second lookup is the same as above, one cache miss for the second dimension.

The whole inner statement has no influence.

Q[v]=0;

As this is v , this can be ignored.

When we sum up, we get two cache misses.

Second approximation

Now, we come back to the assumption that DEG is large. In fact this is wrong because DEG = 16 . So there are fractions of cache misses we also need to consider.

D[W_index[v][i]]

The lookup W_index[v] costs 1/8 of a cache miss (it has a size of 8 bytes, a cache line is 64 byte, so we get a cache miss each eigth iteration).

The same is true for D[W_index[v][i]] , except that D holds integers. In average all but one integer are in cache, so this costs 1/16 of a cache miss.

D[v] + W[v][i]

D[v] is already in cache (this is W_index[v][0] ). But we get another 1/8 of a cache miss for W[v] for the same reasoning as above.

Q[v]=0;

This is another 1/16 of a cache miss.

And surprize, if we now use the code (V2) where the if -clause never evaluates to true , I get 2.395 cache misses per iteration (note that you really need to configure your CPU well, ie, no hyperthreading, no turboboost, performance governor if possible). The calculation above would lead to 2.375. So we are pretty good.

Third approximation

Now there is this unfortunate if clause. How often does this comparison evaluate to true . We can't say, in the beginning it will be quite often, and in the end it will never evaluate to true .

So let's focus on the really first execution of the complete loop. In this case, D[v] is infinity and W[v][i] is a number between 1 and 101. So the loop evaluates to true in each iteration.

And then it gets hard - we get 2.9 cache misses in this iteration. Where are they coming from - all data should already be in cache.

But : This is the "mystery of compilers". You never know what they produce in the end. I compiled with GCC and Clang and get the same measures. I activate -funroll-loops , and suddenly I get 2.5 cache misses. Of course this may be different on your system. When I inspected the assembly, I observed that it is really exactly the same, just the loop has been unrolled four times.

So what does this tell us? You never know what your compiler does except you check it. And even then, you can't be sure.

I guess hardware prefetching or execution order could have an influence here. But this is a mystery.

Regarding perf and your problems with it

I think the measurements you did have two problems:

  • They are relative, the exact line is not that accurate.
  • You are multithreaded, this may be harder to track.

My experience is that when you want to get good measures for a specific part of your code, you really need to check it manually. Sometimes - not always - it can explain things pretty good.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM