简体   繁体   中英

Inconsistent LLC-loads value with perf stat

I'm trying to use perf stat to fetch hardware counter information for a benchmark on Intel's Xeon processor (based on Skylake). When I provide the -e LLC-loads -d -d -d flag, perf stat prints out LLC-loads twice - one due to -e LLC-loads and the other due to detailed flag turned on. However, the results are inconsistent:

$ perf stat -e LLC-loads,LLC-stores,L1-dcache-loads,L1-dcache-stores -d -d -d <my benchmark executable>

Performance counter stats for '<my benchmark executable>':

        5145246847      LLC-loads                                                     (30.78%)
        8167130238      LLC-stores                                                    (30.80%)
      198057619358      L1-dcache-loads                                               (30.80%)
       83142567530      L1-dcache-stores                                              (30.80%)
      197792116698      L1-dcache-loads                                               (30.79%)
       27391515211      L1-dcache-load-misses     #   13.84% of all L1-dcache hits    (30.78%)
        5114059688      LLC-loads                                                     (30.78%)
        3025020254      LLC-load-misses           #   58.97% of all LL-cache hits     (30.76%)
   <not supported>      L1-icache-loads                                             
          58697135      L1-icache-load-misses                                         (30.75%)
      198322967573      dTLB-loads                                                    (30.74%)
         209105723      dTLB-load-misses          #    0.11% of all dTLB cache hits   (30.72%)
           2639992      iTLB-loads                                                    (30.74%)
           1368656      iTLB-load-misses          #   51.84% of all iTLB cache hits   (30.76%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      25.301480157 seconds time elapsed

     585.222699000 seconds user
       1.070800000 seconds sys

As can be seen in the output, there are two LLC-loads in the output with different values. What am I getting wrong?

I've tried multiple different benchmarks assuming that it could be benchmark specific but this behavior is observed everywhere.

Note the multiplexing because you specified so many events: they were sampled for (30.78%) of the total time, with the number extrapolated from that. Skylake only has 4 programmable counters per logical core that can be counting different hardware events at once.

Your program isn't 100% uniform with time, and there's sampling / extrapolation noise, so the numbers are close but differ by a couple %. (The multiplexing code didn't combine an event specified twice, instead it just put two instances of it into the rotation.)

If you just counted two instances of the event without many other events, you'd expect exactly equal counts since they'd both be active at the same time on different HW counters. (Unless the first counter would count any events after being programmed, while the kernel was still programming the second. --all-user would avoid that, telling the HW to count only when the logical core was in user-space.) eg

$ perf stat -e LLC-loads,LLC-loads cmp /dev/zero /dev/full
^Ccmp: Interrupt

 Performance counter stats for 'cmp /dev/zero /dev/full':

            31,425      LLC-loads                                                          
            31,425      LLC-loads                                                          

       2.748813842 seconds time elapsed

       1.113722000 seconds user
       1.633880000 seconds sys

(Small number of counts, I guess cmp uses buffers small enough to fit in L3 cache. I used two different files that would read as all-zeros so it couldn't just detect they were identical.)

Related:

  • Perf tool stat output: multiplex and scaling of "cycles" - instructions:D and cycles:D will tell perf to always count those; there are dedicated non-programmable counters for those events on Intel CPUs, but the multiplexing code doesn't know that. You could do this with other events, but that would take away slots from events where you didn't specify :D .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM