我不明白cachegrind与perf工具之间的缓存缺失计数

Question

I am studying about cache effect using a simple micro-benchmark. 我正在研究使用简单的微基准测试缓存效果。

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. 我认为如果N大于缓存大小，那么缓存在每个第一个读取缓存行都有一个未命中操作。

In my machine, cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. 在我的机器中，缓存行大小= 64Byte，所以我认为完全缓存发生N / 8未命中操作和缓存研磨显示。

However, perf tool displays different result. 但是，perf工具显示不同的结果。 It only occur 34,265 cache miss operations. 它只发生34,265次高速缓存未命中操作。

I am doubted about hardware prefetch, so turn off this function in BIOS. 我对硬件预取感到怀疑，所以在BIOS中关闭此功能。 anyway, result is same. 无论如何，结果是一样的。

I really don't know why perf tool's cache miss occur very small operations than "cachegrind". 我真的不知道为什么perf工具的缓存未命中发生非常小的操作而不是“cachegrind”。 Could someone give me a reasonable explanation? 有人能给我一个合理的解释吗？

1. Here is a simple micro-benchmark program. 这是一个简单的微基准程序。

    #include <stdio.h>
    #define N 10000000

    double A[N];

    int main(){

    int i;
     double temp=0.0;

     for (i=0 ; i<N ; i++){
         temp = A[i]*A[i];
     }   

     return 0;
}

2. Following result is cachegrind's output: 2.以下结果是cachegrind的输出：

#> sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test

    ==27612== Cachegrind, a cache and branch-prediction profiler
    ==27612== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
    ==27612== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
    ==27612== Command: ./test
    ==27612== 
    --27612-- warning: L3 cache found, using its data for the LL simulation.
    ==27612== 
    ==27612== I   refs:      110,102,998
    ==27612== I1  misses:            728
    ==27612== LLi misses:            720
    ==27612== I1  miss rate:        0.00%
    ==27612== LLi miss rate:        0.00%
    ==27612== 
    ==27612== D   refs:       70,038,455  (60,026,965 rd   + 10,011,490 wr)
    ==27612== D1  misses:      1,251,802  ( 1,251,288 rd   +        514 wr)
    ==27612== LLd misses:      1,251,624  ( 1,251,137 rd   +        487 wr)
    ==27612== D1  miss rate:         1.7% (       2.0%     +        0.0%  )
    ==27612== LLd miss rate:         1.7% (       2.0%     +        0.0%  )
    ==27612== 
    ==27612== LL refs:         1,252,530  ( 1,252,016 rd   +        514 wr)
    ==27612== LL misses:       1,252,344  ( 1,251,857 rd   +        487 wr)
    ==27612== LL miss rate:          0.6% (       0.7%     +        0.0%  )

    Generate a report File
    --------------------------------------------------------------------------------
    I1 cache:         32768 B, 64 B, 4-way associative
    D1 cache:         32768 B, 64 B, 8-way associative
    LL cache:         8388608 B, 64 B, 16-way associative
    Command:          ./test
    Data file:        cache_block
    Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
    Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
    Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
    Thresholds:       0.1 100 100 100 100 100 100 100 100
    Include dirs:     
    User annotated:   /home/jin/1_dev/99_test/OI/test.s
    Auto-annotation:  off

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr      D1mr      DLmr         Dw D1mw DLmw 
--------------------------------------------------------------------------------
110,102,998  728  720 60,026,965 1,251,288 1,251,137 10,011,490  514  487  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr      D1mr      DLmr         Dw D1mw DLmw          file:function
--------------------------------------------------------------------------------
110,000,011    1    1 60,000,003 1,250,000 1,250,000 10,000,003    0    0 /home/jin/1_dev/99_test/OI/test.s:main

--------------------------------------------------------------------------------
-- User-annotated source: /home/jin/1_dev/99_test/OI/test.s
--------------------------------------------------------------------------------
        Ir I1mr ILmr         Dr      D1mr      DLmr         Dw D1mw DLmw 

-- line 2 ----------------------------------------
         .    .    .          .         .         .          .    .    .            .comm   A,80000000,32
         .    .    .          .         .         .          .    .    .    .comm   B,80000000,32
         .    .    .          .         .         .          .    .    .    .text
         .    .    .          .         .         .          .    .    .    .globl   main
         .    .    .          .         .         .          .    .    .    .type   main, @function
         .    .    .          .         .         .          .    .    .  main:
         .    .    .          .         .         .          .    .    .  .LFB0:
         .    .    .          .         .         .          .    .    .    .cfi_startproc
         1    0    0          0         0         0          1    0    0    pushq   %rbp
         .    .    .          .         .         .          .    .    .    .cfi_def_cfa_offset 16
         .    .    .          .         .         .          .    .    .    .cfi_offset 6, -16
         1    0    0          0         0         0          0    0    0    movq    %rsp, %rbp
         .    .    .          .         .         .          .    .    .    .cfi_def_cfa_register 6
         1    0    0          0         0         0          0    0    0    movl    $0, %eax
         1    1    1          0         0         0          1    0    0    movq    %rax, -16(%rbp)
         1    0    0          0         0         0          1    0    0    movl    $0, -4(%rbp)
         1    0    0          0         0         0          0    0    0    jmp .L2
         .    .    .          .         .         .          .    .    .  .L3:
10,000,000    0    0 10,000,000         0         0          0    0    0    movl    -4(%rbp), %eax
10,000,000    0    0          0         0         0          0    0    0    cltq
10,000,000    0    0 10,000,000 1,250,000 1,250,000          0    0    0    movsd   A(,%rax,8), %xmm1 
10,000,000    0    0 10,000,000         0         0          0    0    0    movl    -4(%rbp), %eax
10,000,000    0    0          0         0         0          0    0    0    cltq
10,000,000    0    0 10,000,000         0         0          0    0    0    movsd   A(,%rax,8), %xmm0
10,000,000    0    0          0         0         0          0    0    0    mulsd   %xmm1, %xmm0
10,000,000    0    0          0         0         0 10,000,000    0    0    movsd   %xmm0, -16(%rbp)
10,000,000    0    0 10,000,000         0         0          0    0    0    addl    $1, -4(%rbp)
         .    .    .          .         .         .          .    .    .  .L2:
10,000,001    0    0 10,000,001         0         0          0    0    0    cmpl    $9999999, -4(%rbp)
10,000,001    0    0          0         0         0          0    0    0    jle .L3
         1    0    0          0         0         0          0    0    0    movl    $0, %eax
         1    0    0          1         0         0          0    0    0    popq    %rbp
         .    .    .          .         .         .          .    .    .    .cfi_def_cfa 7, 8
         1    0    0          1         0         0          0    0    0    ret
         .    .    .          .         .         .          .    .    .    .cfi_endproc
         .    .    .          .         .         .          .    .    .  .LFE0:
         .    .    .          .         .         .          .    .    .    .size   main, .-main
         .    .    .          .         .         .          .    .    .    .ident  "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
         .    .    .          .         .         .          .    .    .    .section    .note.GNU-stack,"",@progbits

--------------------------------------------------------------------------------
 Ir I1mr ILmr  Dr D1mr DLmr  Dw D1mw DLmw 
--------------------------------------------------------------------------------
100    0    0 100  100  100 100    0    0  percentage of events annotated

3. Following result is perf's output: 3.以下结果是perf的输出：

Performance counter stats for './test' (10 runs): './test'的性能计数器统计信息（10次运行）：

   113,898,951 instructions              #    0.00  insns per cycle          ( +- 12.73% ) [17.36%]
        53,607 cache-references                                              ( +- 12.92% ) [29.23%]
         1,483 cache-misses              #    2.767 % of all cache refs      ( +- 26.66% ) [39.84%]
    48,612,823 L1-dcache-loads                                               ( +-  4.58% ) [50.45%]
        34,256 L1-dcache-load-misses     #    0.07% of all L1-dcache hits    ( +- 18.94% ) [54.38%]
    14,992,686 L1-dcache-stores                                              ( +-  4.90% ) [52.58%]
         1,980 L1-dcache-store-misses                                        ( +-  6.36% ) [61.83%]
         1,154 LLC-loads                                                     ( +- 61.14% ) [53.22%]
            18 LLC-load-misses           #    1.60% of all LL-cache hits     ( +- 16.26% ) [10.87%]
             0 LLC-prefetches                                               [ 0.00%]

   0.037949840 seconds time elapsed                                          ( +-  3.57% )

More Experimental result(2014.05.13): 更多实验结果（2014.05.13）：

jin@desktop:~/1_dev/99_test/OI$ sudo perf stat -r 10 -e instructions -e r53024e -e r53014e -e L1-dcache-loads -e L1-dcache-load-misses -e r500f0a -e r500109 ./test

 Performance counter stats for './test' (10 runs):

   116,464,390 instructions              #    0.00  insns per cycle          ( +-  2.67% ) [67.43%]
         5,994 r53024e  <-- L1D hardware prefetch misses                     ( +- 21.74% ) [70.92%]
     1,387,214 r53014e  <-- L1D hardware prefetch requests                   ( +-  2.37% ) [75.61%]
    61,667,802 L1-dcache-loads                                               ( +-  1.27% ) [78.12%]
        26,297 L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +- 48.92% ) [43.24%]
             0 r500f0a  <-- LLC lines allocated                                 [56.71%]
        41,545 r500109  <-- Number of LLC read misses                        ( +-  6.16% ) [50.08%]

   0.037080925 seconds time elapsed

In above result, the number of "L1D hardware prefetch request" seems like D1 miss(1,250,000) on cachegrind. 在上面的结果中，“L1D硬件预取请求”的数量在cachegrind上似乎是D1 miss（1,250,000）。

In my conclusion, if memory access the "stream pattern", then L1D prefetch function is enabled. 在我的结论中，如果内存访问“流模式”，则启用L1D预取功能。 and I can't check how many byte load from the memory due to LLC miss information. 由于LLC未命中信息，我无法检查内存中有多少字节负载。

Is my conclusion correct? 我的结论是否正确？

Answer 1

Bottom line: your assumption regarding prefetches is correct, but your workaround isn't. 底线：您对预取的假设是正确的，但您的解决方法不是。

First, as Carlo pointed out, this loop would usually get optimized out by any compiler. 首先，正如Carlo所指出的，这个循环通常会被任何编译器优化掉。 Since both perf and cachegrind show ~100M instructions do retire, I guess you didn't compile with optimizations, which means the behavior isn't very realistic - for example, your loop variable may be stored in memory instead of in a register, adding pointless memory accesses and skewing cache counters. 由于perf和cachegrind都显示~100M指令退出，我猜你没有使用优化进行编译，这意味着行为不太现实 - 例如，你的循环变量可能存储在内存而不是寄存器中，添加无意义的内存访问和偏移缓存计数器。

Now, the difference between your runs is that cachgrind is just a cache simulator, it doesn't simulate prefetches, so every first access to a line misses as expected. 现在，您的运行之间的区别在于cachgrind只是一个缓存模拟器，它不会模拟预取，因此每次首次访问一行都会按预期错过。 On the other hand, the real CPU does have HW prefetches as you can see, so the first time each line is brought from memory, it's done by a prefetch (thanks to the simple streaming pattern), and not by an actual demand load. 另一方面，真正的CPU确实有硬件预取，因此第一次从内存中获取每一行，它是通过预取（由于简单的流模式）完成的，而不是由实际的需求负载完成的。 This is why perf misses counting these accesses with the normal counters. 这就是为什么perf错过了使用普通计数器计数这些访问的原因。

You can see that when enabling the prefetch counter, you see roughly the same N/8 prefetches (plus some additional ones from other types of accesses probably). 您可以看到，在启用预取计数器时，您会看到大致相同的N / 8预取（以及可能来自其他类型访问的其他一些预取）。

Disabling the prefetcher would seem the right thing, however most CPUs don't offer too much control over that. 禁用预取器似乎是正确的，但是大多数CPU都没有提供过多的控制权。 You didn't specify what processor type you're using, but if it was Intel for example, you can see here that only the L2 prefetches are controlled by the BIOS, while your output shows L1 prefetches - https://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers 您没有指定您正在使用的处理器类型，但如果它是Intel，例如，您可以看到这里只有L2预取由BIOS控制，而您的输出显示L1预取 - https://software.intel .COM / EN-US /用品/优化应用性能上，英特尔采用硬件实现，预取coret -微体系结构

Search the manuals for your CPU type to see which L1 prefetchers exist, and understand how to work around them. 在手册中搜索您的CPU类型，以查看存在哪些L1预取程序，并了解如何解决这些问题。 Usually a simple stride (larger than a single cache line) should suffice to trick them, but if that doesn't work, you'll need to change your access pattern to be more random. 通常一个简单的步幅（大于单个缓存行）应该足以欺骗它们，但如果这不起作用，则需要将访问模式更改为更随机。 You can randomize some permutation of indices for that. 您可以随机化一些索引的排列。

我不明白cachegrind与perf工具之间的缓存缺失计数

问题描述

1. Here is a simple micro-benchmark program. 这是一个简单的微基准程序。

2. Following result is cachegrind's output: 2.以下结果是cachegrind的输出：

3. Following result is perf's output: 3.以下结果是perf的输出：

More Experimental result(2014.05.13): 更多实验结果（2014.05.13）：

1 个解决方案

解决方案1
1 2015-05-01 07:20:49

我不明白cachegrind与perf工具之间的缓存缺失计数

问题描述

1. Here is a simple micro-benchmark program. 这是一个简单的微基准程序。

2. Following result is cachegrind's output: 2.以下结果是cachegrind的输出：

3. Following result is perf's output: 3.以下结果是perf的输出：

More Experimental result(2014.05.13): 更多实验结果（2014.05.13）：

1 个解决方案

解决方案1 1 2015-05-01 07:20:49

解决方案1
1 2015-05-01 07:20:49