你如何解释cachegrind输出缓存未命中？

Question

Out of curiosity I ran coded up several different versions of matrix Multiplication and ran cachegrind against it. 出于好奇，我运行编码了几个不同版本的矩阵乘法，并对它运行cachegrind。 In my results below, I was wondering which parts were L1,L2,L3 misses and references and what it all really means? 在下面的结果中，我想知道哪些部分是L1，L2，L3未命中和引用以及它们的真正含义是什么？ Below is my code for the matrix multiplications also, in case anyone needs that. 下面是我的矩阵乘法代码，万一有人需要。

#define SLOWEST
==6933== Cachegrind, a cache and branch-prediction profiler
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==6933== Command: ./a.out 500
==6933== 
--6933-- warning: L3 cache found, using its data for the LL simulation.
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 60.7487 seconds.
==6933== 
==6933== I   refs:      6,039,791,314
==6933== I1  misses:            1,611
==6933== LLi misses:            1,519
==6933== I1  miss rate:          0.00%
==6933== LLi miss rate:          0.00%
==6933== 
==6933== D   refs:      2,892,704,678  (2,763,005,485 rd   + 129,699,193 wr)
==6933== D1  misses:      136,223,560  (  136,174,705 rd   +      48,855 wr)
==6933== LLd misses:           53,675  (        5,247 rd   +      48,428 wr)
==6933== D1  miss rate:           4.7% (          4.9%     +         0.0%  )
==6933== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==6933== 
==6933== LL refs:         136,225,171  (  136,176,316 rd   +      48,855 wr)
==6933== LL misses:            55,194  (        6,766 rd   +      48,428 wr)
==6933== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define SLOWER
==8463== Cachegrind, a cache and branch-prediction profiler
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8463== Command: ./a.out 500
==8463== 
--8463-- warning: L3 cache found, using its data for the LL simulation.
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 49.7397 seconds.
==8463== 
==8463== I   refs:      4,537,213,120
==8463== I1  misses:            1,571
==8463== LLi misses:            1,487
==8463== I1  miss rate:          0.00%
==8463== LLi miss rate:          0.00%
==8463== 
==8463== D   refs:      2,891,485,608  (2,761,862,312 rd   + 129,623,296 wr)
==8463== D1  misses:       59,961,522  (   59,913,256 rd   +      48,266 wr)
==8463== LLd misses:           53,113  (        5,246 rd   +      47,867 wr)
==8463== D1  miss rate:           2.0% (          2.1%     +         0.0%  )
==8463== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==8463== 
==8463== LL refs:          59,963,093  (   59,914,827 rd   +      48,266 wr)
==8463== LL misses:            54,600  (        6,733 rd   +      47,867 wr)
==8463== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define SLOW
==9174== Cachegrind, a cache and branch-prediction profiler
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9174== Command: ./a.out 500
==9174== 
--9174-- warning: L3 cache found, using its data for the LL simulation.
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 35.8901 seconds.
==9174== 
==9174== I   refs:      3,039,713,059
==9174== I1  misses:            1,570
==9174== LLi misses:            1,486
==9174== I1  miss rate:          0.00%
==9174== LLi miss rate:          0.00%
==9174== 
==9174== D   refs:      1,893,235,586  (1,763,112,301 rd   + 130,123,285 wr)
==9174== D1  misses:       63,285,950  (   62,987,684 rd   +     298,266 wr)
==9174== LLd misses:           53,113  (        5,246 rd   +      47,867 wr)
==9174== D1  miss rate:           3.3% (          3.5%     +         0.2%  )
==9174== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==9174== 
==9174== LL refs:          63,287,520  (   62,989,254 rd   +     298,266 wr)
==9174== LL misses:            54,599  (        6,732 rd   +      47,867 wr)
==9174== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define MEDIUM
==7838== Cachegrind, a cache and branch-prediction profiler
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==7838== Command: ./a.out 500
==7838== 
--7838-- warning: L3 cache found, using its data for the LL simulation.
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 23.4097 seconds.
==7838== 
==7838== I   refs:      2,548,967,151
==7838== I1  misses:            1,610
==7838== LLi misses:            1,522
==7838== I1  miss rate:          0.00%
==7838== LLi miss rate:          0.00%
==7838== 
==7838== D   refs:      1,399,237,303  (1,267,363,440 rd   + 131,873,863 wr)
==7838== D1  misses:          592,807  (      293,091 rd   +     299,716 wr)
==7838== LLd misses:           53,147  (        5,248 rd   +      47,899 wr)
==7838== D1  miss rate:           0.0% (          0.0%     +         0.2%  )
==7838== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==7838== 
==7838== LL refs:             594,417  (      294,701 rd   +     299,716 wr)
==7838== LL misses:            54,669  (        6,770 rd   +      47,899 wr)
==7838== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define MEDIUMISH
==8438== Cachegrind, a cache and branch-prediction profiler
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8438== Command: ./a.out 500
==8438== 
--8438-- warning: L3 cache found, using its data for the LL simulation.
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 24.0327 seconds.
==8438== 
==8438== I   refs:      2,550,211,553
==8438== I1  misses:            1,576
==8438== LLi misses:            1,488
==8438== I1  miss rate:          0.00%
==8438== LLi miss rate:          0.00%
==8438== 
==8438== D   refs:      1,400,107,343  (1,267,610,303 rd   + 132,497,040 wr)
==8438== D1  misses:          339,977  (       42,583 rd   +     297,394 wr)
==8438== LLd misses:           53,114  (        5,248 rd   +      47,866 wr)
==8438== D1  miss rate:           0.0% (          0.0%     +         0.2%  )
==8438== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==8438== 
==8438== LL refs:             341,553  (       44,159 rd   +     297,394 wr)
==8438== LL misses:            54,602  (        6,736 rd   +      47,866 wr)
==8438== LL miss rate:            0.0% (          0.0%     +         0.0%  )

Matrix Multiplication Code. 矩阵乘法码。

#if defined(SLOWEST)
    void multiply (float **A, float **B, float **out, int size) {
        for (int row=0;row<size;row++)
            for (int col=0;col<size;col++)
                for (int in=0;in<size;in++)
                    out[row][col] += A[row][in] * B[in][col];
    }
// Takes in 1-D arrays, same as before.
#elif defined(SLOWER)
    void multiply (float *A, float *B, float *out, int size) {
        for (int row=0;row<size;row++)
            for (int col=0;col<size;col++)
                for (int in=0;in<size;in++)
                    out[row * size + col] += A[row * size + in] * B[in * size + col];
    }
// Flips first and second loops
#elif defined(SLOW)
    void multiply (float *A, float *B, float *out, int size) {
        for (int col=0;col<size;col++)
            for (int row=0;row<size;row++) {
                float curr = 0;  // prevents from calculating position each time through
                for (int in=0;in<size;in++)
                    curr += A[row * size + in] * B[in *size + col];
                out[row * size + col] = curr;
            }
    }
#elif defined(MEDIUM)
    // Keeps it organized for future codes.
    float dotProduct(float *A, float *B, int size) {
        float curr = 0;

        for (int i=0;i<size;i++)
            curr += A[i] * B[i];

        return curr;
    }
    void multiply (float *A, float *B, float *out, int size) {
        float *temp = new float[size];

        for (int col=0;col<size;col++) {
            for (int i=0;i<size;i++)  // stores column into sequential array
                temp[i] = B[i * size + col];
            for (int row=0;row<size;row++)
                out[row * size + col] = dotProduct(&A[row], temp, size);  // uses function above for dot product.
        }

        delete[] temp;
    }
#elif defined(MEDIUMISH)
    float dotProduct(float *A, float *B, int size) {
        float curr = 0;

        for (int i=0;i<size;i++)
            curr += A[i] * B[i];

        return curr;
    }
    void multiply (float *A, float *B, float *out, int size) {
        for (int i=0;i<size-1;i++)
            for (int j=i+1;j<size;j++)
                std::swap(B[i * size + j], B[j * size + i]);

        for (int col=0;col<size;col++)
            for (int row=0;row<size;row++)
                out[row * size + col] = dotProduct(&A[row], &B[row], size);  // uses function above for dot product.
    }
#elif defined(FAST)

#elif defined(FASTER)

#endif

Answer 1

According to the documentation cachegrind only simulate the first and the last level caches: 根据文档， cachegrind只模拟第一级和最后一级缓存：

Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. Cachegrind模拟程序如何与机器的缓存层次结构和（可选）分支预测器进行交互。 It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). 它模拟具有独立的第一级指令和数据缓存（I1和D1）的机器，由统一的二级缓存（L2）支持。 This exactly matches the configuration of many modern machines. 这与许多现代机器的配置完全匹配。

However, some modern machines have three or four levels of cache. 但是，一些现代机器具有三级或四级缓存。 For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and last-level caches. 对于这些机器（在Cachegrind可以自动检测缓存配置的情况下），Cachegrind模拟第一级和最后一级缓存。 The reason for this choice is that the last-level cache has the most influence on runtime, as it masks accesses to main memory. 这种选择的原因是最后一级缓存对运行时影响最大，因为它掩盖了对主内存的访问。 Furthermore, the L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this cache (eg. traversing a matrix column-wise with the row length being a power of 2). 此外，L1高速缓存通常具有低关联性，因此模拟它们可以检测代码与该高速缓存严重交互的情况（例如，以行长度为2的幂来逐列遍历矩阵）。

What that means is that you can't get L2 information but only L1 and L3 in your case. 这意味着您无法获得L2信息，但在您的情况下只能获得L1和L3。

The first part of cachegrind's output reports information about L1 instructions cache. cachegrind输出的第一部分报告有关L1指令缓存的信息。 In all your example, the number of L1 instruction caches misses is insignifiant, the miss rate is always 0%. 在所有示例中，L1指令缓存未命中的数量是不重要的，未命中率始终为0％。 It means that all your programs fit in your L1 instruction cache. 这意味着您的所有程序都适合您的L1指令缓存。

The second part of the output reports information about L1 and LL (last level cache, L3 in your case) data caches. 输出的第二部分报告有关L1和LL（最后一级缓存，在您的情况下为L3）数据缓存的信息。 Using the D1 miss rate: information you should see which version of your matrix multiplication algorithm is "the most cache efficient" 使用D1未命中率：您应该看到哪个版本的矩阵乘法算法是“缓存效率最高”的信息

The final part of cachegrind output summs up information about LL (last level cache, L3 in your case) for both instructions and data. cachegrind输出的最后一部分为指令和数据汇总了有关LL（最后一级缓存，在您的情况下为L3）的信息。 It thus gives the number of memory accesses and the percentage of memory requests served by the cache. 因此，它给出了内存访问次数和缓存所服务的内存请求的百分比。

你如何解释cachegrind输出缓存未命中？

问题描述

1 个解决方案

解决方案1
19 已采纳 2014-01-13 07:57:54

你如何解释cachegrind输出缓存未命中？

问题描述

1 个解决方案

解决方案1 19 已采纳 2014-01-13 07:57:54

解决方案1
19 已采纳 2014-01-13 07:57:54