Cachegrind: Why so many cache misses?

Question

I'm currently learning about various profiling and performance utilities under Linux, notably valgrind/cachegrind.

I have following toy program:

#include <iostream>
#include <vector>

int
main() {
    const unsigned int COUNT = 1000000;

    std::vector<double> v;

    for(int i=0;i<COUNT;i++) {
        v.push_back(i);
    }

    double counter = 0;
    for(int i=0;i<COUNT;i+=8) {
        counter += v[i+0];
        counter += v[i+1];
        counter += v[i+2];
        counter += v[i+3];
        counter += v[i+4];
        counter += v[i+5];
        counter += v[i+6];
        counter += v[i+7];
    }

    std::cout << counter << std::endl;
}

Compiling this program with g++ -O2 -g main.cpp and running valgrind --tool=cachegrind ./a.out , then cg_annotate cachegrind.out.31694 --auto=yes produces following result:

    --------------------------------------------------------------------------------
-- Auto-annotated source: /home/andrej/Data/projects/pokusy/dod.cpp
--------------------------------------------------------------------------------
       Ir I1mr ILmr        Dr    D1mr    DLmr        Dw D1mw DLmw 

        .    .    .         .       .       .         .    .    .  #include <iostream>
        .    .    .         .       .       .         .    .    .  #include <vector>
        .    .    .         .       .       .         .    .    .  
        .    .    .         .       .       .         .    .    .  int
        7    1    1         1       0       0         4    0    0  main() {
        .    .    .         .       .       .         .    .    .      const unsigned int COUNT = 1000000;
        .    .    .         .       .       .         .    .    .  
        .    .    .         .       .       .         .    .    .      std::vector<double> v;
        .    .    .         .       .       .         .    .    .  
5,000,000    0    0 1,999,999       0       0         0    0    0      for(int i=0;i<COUNT;i++) {
3,000,000    0    0         0       0       0 1,000,000    0    0          v.push_back(i);
        .    .    .         .       .       .         .    .    .      }
        .    .    .         .       .       .         .    .    .  
        3    0    0         0       0       0         0    0    0      double counter = 0;
  250,000    0    0         0       0       0         0    0    0      for(int i=0;i<COUNT;i+=8) {
  250,000    0    0   125,000       1       1         0    0    0          counter += v[i+0];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+1];
  125,000    1    1   125,000       0       0         0    0    0          counter += v[i+2];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+3];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+4];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+5];
  125,000    0    0   125,000 125,000 125,000         0    0    0          counter += v[i+6];
  125,000    0    0   125,000       0       0         0    0    0          counter += v[i+7];
        .    .    .         .       .       .         .    .    .      }
        .    .    .         .       .       .         .    .    .  
        .    .    .         .       .       .         .    .    .      std::cout << counter << std::endl;
       11    0    0         6       1       1         0    0    0  }

What I'm worried about is this line:

125,000    0    0   125,000 125,000 125,000         0    0    0          counter += v[i+6];

Why this line has so many cache-misses? The data are in contiguous memory, each iteration I'm reading 64-bytes of data (assuming the cache line is 64 bytes long).

I'm running this program on Ubuntu Linux 18.04.1, kernel 4.19, g++ 7.3.0. Computer is AMD 2400G.

Answer 1

It's important to first check the generated assembly code because that's what cachegrind is going to simulate. The loop that you are interested in gets compiled into the following code:

.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28

There are 8 read accesses per iteration and each is 8-byte in size. In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). In this particular instance, it happens that a new cache line is accessed at counter += v[i+6]; . Then, the next 7 accesses would be to the same 64-byte cache line. The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. It will just tell you that a different source code line incurs that many misses.

Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. In addition, the size of the L3 is 4MB. But since the total array size is 8MB, then the following loop:

for(int i=0;i<COUNT;i++) {
    v.push_back(i);
}

will leave only the second half of the array in the LLC. Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. This explains the 1 in D1mr and DLmr columns. Then at counter += v[i+6]; , another line is accessed, which is also a miss in both levels of the cache. However, in this case, the next 7 accesses will all be hits. At this point, only the access from counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8).

Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. For example, on my Haswell processor, by using perf , the total number of L1D misses from all of the code (both loops) is only 65,796. So cachegrind may either significantly overestimate or underestimate the miss and hit counts.

Answer 2

I suspect that this happens because vector buffer is not aligned on cache line boundary. That is sudden jump in cache misses marks a point when we proceed to a next line. So I suggest to check v.data() value.

Answer 3

In my vision this looks absolutely okay if we forget about first 1M push-backs (8Mb... well maybe you do not have enough room in L2 for that). So if we will assume that our data is not in any level of cache then every time you read 8 doubles you have to ask RAM for next L1 line. So overall your stats looks fine. you are invoking QWORD reads 1M times and generate 125k requests to RAM due to simplet sequential access pattern.

Cachegrind: Why so many cache misses?

Question

3 answers

solution1
4 2018-11-09 20:19:38

solution2
2 2018-11-09 18:55:08

solution3
1 2019-09-05 22:46:32

Cachegrind: Why so many cache misses?

Question

3 answers

solution1 4 2018-11-09 20:19:38

solution2 2 2018-11-09 18:55:08

solution3 1 2019-09-05 22:46:32

solution1
4 2018-11-09 20:19:38

solution2
2 2018-11-09 18:55:08

solution3
1 2019-09-05 22:46:32