Cachegrind：为什么这么多缓存未命中？

Question

I'm currently learning about various profiling and performance utilities under Linux, notably valgrind/cachegrind. 我目前正在学习Linux下的各种分析和性能实用程序，特别是valgrind / cachegrind。

I have following toy program: 我有以下玩具程序：

#include <iostream>
#include <vector>

int
main() {
    const unsigned int COUNT = 1000000;

    std::vector<double> v;

    for(int i=0;i<COUNT;i++) {
        v.push_back(i);
    }

    double counter = 0;
    for(int i=0;i<COUNT;i+=8) {
        counter += v[i+0];
        counter += v[i+1];
        counter += v[i+2];
        counter += v[i+3];
        counter += v[i+4];
        counter += v[i+5];
        counter += v[i+6];
        counter += v[i+7];
    }

    std::cout << counter << std::endl;
}

Compiling this program with g++ -O2 -g main.cpp and running valgrind --tool=cachegrind ./a.out , then cg_annotate cachegrind.out.31694 --auto=yes produces following result: 用g++ -O2 -g main.cpp编译这个程序并运行valgrind --tool=cachegrind ./a.out ，然后cg_annotate cachegrind.out.31694 --auto=yes产生以下结果：

    --------------------------------------------------------------------------------
-- Auto-annotated source: /home/andrej/Data/projects/pokusy/dod.cpp
--------------------------------------------------------------------------------
       Ir I1mr ILmr        Dr    D1mr    DLmr        Dw D1mw DLmw 

        .    .    .         .       .       .         .    .    .  #include <iostream>
        .    .    .         .       .       .         .    .    .  #include <vector>
        .    .    .         .       .       .         .    .    .  
        .    .    .         .       .       .         .    .    .  int
        7    1    1         1       0       0         4    0    0  main() {
        .    .    .         .       .       .         .    .    .      const unsigned int COUNT = 1000000;
        .    .    .         .       .       .         .    .    .  
        .    .    .         .       .       .         .    .    .      std::vector<double> v;
        .    .    .         .       .       .         .    .    .  
5,000,000    0    0 1,999,999       0       0         0    0    0      for(int i=0;i<COUNT;i++) {
3,000,000    0    0         0       0       0 1,000,000    0    0          v.push_back(i);
        .    .    .         .       .       .         .    .    .      }
        .    .    .         .       .       .         .    .    .  
        3    0    0         0       0       0         0    0    0      double counter = 0;
  250,000    0    0         0       0       0         0    0    0      for(int i=0;i<COUNT;i+=8) {
  250,000    0    0   125,000       1       1         0    0    0          counter += v[i+0];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+1];
  125,000    1    1   125,000       0       0         0    0    0          counter += v[i+2];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+3];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+4];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+5];
  125,000    0    0   125,000 125,000 125,000         0    0    0          counter += v[i+6];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+7];
        .    .    .         .       .       .         .    .    .      }
        .    .    .         .       .       .         .    .    .  
        .    .    .         .       .       .         .    .    .      std::cout << counter << std::endl;
       11    0    0         6       1       1         0    0    0  }

What I'm worried about is this line: 我担心的是这一行：

125,000    0    0   125,000 125,000 125,000         0    0    0          counter += v[i+6];

Why this line has so many cache-misses? 为什么这一行有如此多的缓存未命中？ The data are in contiguous memory, each iteration I'm reading 64-bytes of data (assuming the cache line is 64 bytes long). 数据在连续的内存中，每次迭代我都在读64字节的数据（假设缓存行长度为64字节）。

I'm running this program on Ubuntu Linux 18.04.1, kernel 4.19, g++ 7.3.0. 我在Ubuntu Linux 18.04.1，内核4.19，g ++ 7.3.0上运行该程序。 Computer is AMD 2400G. 电脑是AMD 2400G。

Answer 1

It's important to first check the generated assembly code because that's what cachegrind is going to simulate. 首先检查生成的汇编代码是很重要的，因为这就是cachegrind要模拟的内容。 The loop that you are interested in gets compiled into the following code: 您感兴趣的循环被编译为以下代码：

.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28

There are 8 read accesses per iteration and each is 8-byte in size. 每次迭代有8次读访问，每次访问大小为8字节。 In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. 在C ++中，保证每个元素都是8字节对齐的，但每次迭代最多可以访问两个缓存行，具体取决于v向量数组的地址。 cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). cachegrind使用动态二进制检测来获取每个内存访问的地址，并应用其缓存层次结构模型来确定访问是否是层次结构中每个级别的命中或未命中（尽管它仅支持L1和LLC）。 In this particular instance, it happens that a new cache line is accessed at counter += v[i+6]; 在这个特定的例子中，碰巧在counter += v[i+6];处访问新的高速缓存行counter += v[i+6]; . 。 Then, the next 7 accesses would be to the same 64-byte cache line. 然后，接下来的7次访问将是相同的64字节高速缓存行。 The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. 访问新缓存行的源代码行不会影响cachegrind报告的未命中总数。 It will just tell you that a different source code line incurs that many misses. 它只会告诉你一个不同的源代码行会导致许多未命中。

Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. 请注意，cachegrind根据其运行的计算机模拟非常简化的缓存层次结构。 In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. 在这种情况下，它是AMD 2400G，它在所有高速缓存级别具有64字节的行大小。 In addition, the size of the L3 is 4MB. 另外，L3的大小是4MB。 But since the total array size is 8MB, then the following loop: 但由于总数组大小为8MB，因此以下循环：

for(int i=0;i<COUNT;i++) {
    v.push_back(i);
}

will leave only the second half of the array in the LLC. 只会在LLC中留下阵列的后半部分。 Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. 现在，在计算counter的第二个循环的第一次迭代中，访问的第一行将不在L1或LLC中。 This explains the 1 in D1mr and DLmr columns. 这解释了D1mr和DLmr列中的1。 Then at counter += v[i+6]; 然后在counter += v[i+6]; , another line is accessed, which is also a miss in both levels of the cache. ，访问另一行，这也是两个级别的缓存中的未命中。 However, in this case, the next 7 accesses will all be hits. 但是，在这种情况下，接下来的7次访问都将被点击。 At this point, only the access from counter += v[i+6]; 此时，只有来自counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8). 将错过，有125,000个这样的访问（100万/ 8）。

Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. 请注意，cachegrind只是一个模拟器，实际处理器上实际发生的事情可能是非常不同的。 For example, on my Haswell processor, by using perf , the total number of L1D misses from all of the code (both loops) is only 65,796. 例如，在我的Haswell处理器上，通过使用perf ，所有代码（两个循环）中L1D未命中的总数仅为65,796。 So cachegrind may either significantly overestimate or underestimate the miss and hit counts. 因此，cachegrind可能会大大高估或低估未命中和命中数。

Answer 2

I suspect that this happens because vector buffer is not aligned on cache line boundary. 我怀疑这是因为矢量缓冲区未在缓存行边界上对齐。 That is sudden jump in cache misses marks a point when we proceed to a next line. 这是缓存未命中的突然跳转标志着我们进入下一行时的一个点。 So I suggest to check v.data() value. 所以我建议检查v.data()值。

Answer 3

In my vision this looks absolutely okay if we forget about first 1M push-backs (8Mb... well maybe you do not have enough room in L2 for that). 在我的视野中，如果我们忘记了第一次1M推回（8Mb ......那么你可能没有足够的空间用于L2），这看起来绝对没问题。 So if we will assume that our data is not in any level of cache then every time you read 8 doubles you have to ask RAM for next L1 line. 因此，如果我们假设我们的数据不在任何级别的缓存中，那么每次读取8个双倍时，您必须向RAM请求下一个L1线。 So overall your stats looks fine. 总的来说，你的统计数据看起来不错 you are invoking QWORD reads 1M times and generate 125k requests to RAM due to simplet sequential access pattern. 由于简单的顺序访问模式，您正在调用QWORD读取1M次并生成125k个RAM请求。

Cachegrind：为什么这么多缓存未命中？

问题描述

3 个解决方案

解决方案1
4 2018-11-09 20:19:38

解决方案2
2 2018-11-09 18:55:08

解决方案3
1 2019-09-05 22:46:32

Cachegrind：为什么这么多缓存未命中？

问题描述

3 个解决方案

解决方案1 4 2018-11-09 20:19:38

解决方案2 2 2018-11-09 18:55:08

解决方案3 1 2019-09-05 22:46:32

解决方案1
4 2018-11-09 20:19:38

解决方案2
2 2018-11-09 18:55:08

解决方案3
1 2019-09-05 22:46:32