缓存未命中压力测试：惊人的结果..任何解释？

Question

In order to get the actual performance of a modern computer relatively to cache misses (how 'spread' is the data in memory), I conducted a simple test where I allocate 500 MB of RAM, and then perform a constant number of reads, and I perform that test with increasing byte offsets. 为了获得现代计算机相对于缓存未命中的实际性能（如何'传播'是内存中的数据），我进行了一个简单的测试，我分配500 MB的RAM，然后执行恒定数量的读取，并且我通过增加字节偏移来执行该测试。 Finally, I wrap over the end of the 1000 MB buffer when I reach it. 最后，当我到达它时，我将1000 MB缓冲区的末尾包裹起来。

I'm quite surprised by the results. 我对结果感到非常惊讶。 It looks like there is a cost barrier around 32 bytes, and another one around 32 KB. 看起来有大约32字节的成本障碍，另一个大约32 KB。 I guess this has something to do with L1/L2/L3 cache loads, or virtual memory page size? 我想这与L1 / L2 / L3缓存加载或虚拟内存页面大小有关？ What stunned me the most is that there seems to be only about 16 completely different memory locations that are being cached. 令我震惊的是，似乎只有大约16个完全不同的内存位置被缓存。 That's very low!!! 那非常低！ Any explanations (OS, hardware)? 任何解释（操作系统，硬件）？

Here are the results on 3 different computers, from the fastest one to the cheapest one, followed by my simple test code (uses only standard libs). 以下是3台不同计算机上的结果，从最快的一台到最便宜的一台，其次是我的简单测试代码（仅使用标准库）。

16 GB RAM fast HP workstation (test in 32 bits Windows): 16 GB RAM快速HP工作站（32位Windows测试）：

time=0.364000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.231000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.339000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=0.567000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=1.177000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=1.806000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=2.293000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=2.464000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=2.621000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=2.775000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=2.908000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=3.007000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=3.183000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=3.758000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=4.287000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=6.366000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=6.124000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=5.295000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=5.540000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=5.844000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=5.785000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=5.714000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=5.825000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=5.759000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=2.222000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.471000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.377000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.166000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

4 GB RAM MacBookPro 2010 (test in 32 bits Windows): 4 GB RAM MacBookPro 2010（在32位Windows中测试）：

time=0.476000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.357000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.634000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=1.173000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=2.360000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=3.469000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=3.990000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=3.549000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=3.758000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=3.867000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=4.275000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=4.310000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=4.584000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=5.144000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=6.100000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=8.111000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=6.256000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=6.311000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=6.416000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=6.635000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=6.530000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=6.544000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=6.545000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=5.272000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=1.524000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.538000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.508000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.817000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

4GB RAM cheap Acer "family computer": 4GB RAM廉价宏碁“家庭电脑”：

time=0.732000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.549000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.765000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=1.196000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=2.318000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=2.483000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=2.760000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=3.194000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=3.369000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=3.720000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=4.842000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=5.797000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=9.865000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=19.273000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=19.282000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=19.606000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=20.242000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=20.956000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=22.627000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=24.336000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=24.403000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=23.060000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=20.553000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=14.460000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=1.752000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.963000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.687000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.453000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

Code: 码：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define MEMBLOCSIZE ((2<<20)*500)//1000MB

int readMemory( int* data, int* dataEnd, int numReads, int incrementPerRead ) {
  int accum = 0;
  int* ptr = data;

  while(true) {
    accum += *ptr;
    if( numReads-- == 0)
      return accum;

    ptr += incrementPerRead;

    if( ptr >= dataEnd )
      ptr = data;
  }
}

int main()
{
  int* data = (int*)malloc(MEMBLOCSIZE);
  int* dataEnd = data+(MEMBLOCSIZE / sizeof(int));

  int numReads = (MEMBLOCSIZE / sizeof(int));
  int dummyTotal = 0;
  int increment = 1;
  for( int loop = 0; loop < 28; ++loop ) {
    int startTime = clock();

    dummyTotal += readMemory(data, dataEnd, numReads, increment);

    int endTime = clock();
    double deltaTime = double(endTime-startTime)/double(CLOCKS_PER_SEC);

    printf("time=%f byteIncrement=%d numReadLocations=%d numReads=%d\n",
      deltaTime, increment*sizeof(int), MEMBLOCSIZE/(increment*sizeof(int)), numReads);

    increment *= 2;
  }
  //Use dummyTotal: make sure the optimizer is not removing my code...
  return dummyTotal == 666 ? 1: 0;
}

Based on some comments I modified my test to use only 250 MB of RAM, and to do 16 consecutive reads for each 'read' in case it activates the prefetching. 根据一些评论，我修改了我的测试，只使用250 MB的RAM，并在每次“读取”时连续读取16次，以防它激活预取。 It still has similar results, however it is the case that the last tests, the ones reading few distant locations, have a better performance (2 seconds instead of 5), so it is probably because prefetching was not activated with the initial test. 它仍然有类似的结果，但是最后的测试，即读取少量远程位置的测试具有更好的性能（2秒而不是5），因此可能是因为初始测试未激活预取。

#define MEMBLOCSIZE 262144000//250MB

int readMemory( int* data, int* dataEnd, int numReads, int incrementPerRead ) {
  int accum = 0;
  int* ptr = data;

  while(true) {
    accum += *ptr;
    if( numReads-- == 0)
      return accum;

    //Do 16 consecutive reads
    for( int i = 1; i < 17; ++i )
      accum += *(ptr+i);

    ptr += incrementPerRead;

    if( ptr >= dataEnd+17 )
      ptr = data;
  }
}

Results for this updated test for MacBookPro 2010: MacBookPro 2010更新测试的结果：

time=0.691000 byteIncrement=4 numReadLocations=65536000 numReads=65536000
time=0.620000 byteIncrement=8 numReadLocations=32768000 numReads=65536000
time=0.715000 byteIncrement=16 numReadLocations=16384000 numReads=65536000
time=0.827000 byteIncrement=32 numReadLocations=8192000 numReads=65536000
time=0.917000 byteIncrement=64 numReadLocations=4096000 numReads=65536000
time=1.440000 byteIncrement=128 numReadLocations=2048000 numReads=65536000
time=2.646000 byteIncrement=256 numReadLocations=1024000 numReads=65536000
time=3.720000 byteIncrement=512 numReadLocations=512000 numReads=65536000
time=3.854000 byteIncrement=1024 numReadLocations=256000 numReads=65536000
time=4.673000 byteIncrement=2048 numReadLocations=128000 numReads=65536000
time=4.729000 byteIncrement=4096 numReadLocations=64000 numReads=65536000
time=4.784000 byteIncrement=8192 numReadLocations=32000 numReads=65536000
time=5.021000 byteIncrement=16384 numReadLocations=16000 numReads=65536000
time=5.022000 byteIncrement=32768 numReadLocations=8000 numReads=65536000
time=4.871000 byteIncrement=65536 numReadLocations=4000 numReads=65536000
time=5.163000 byteIncrement=131072 numReadLocations=2000 numReads=65536000
time=5.276000 byteIncrement=262144 numReadLocations=1000 numReads=65536000
time=4.699000 byteIncrement=524288 numReadLocations=500 numReads=65536000
time=1.997000 byteIncrement=1048576 numReadLocations=250 numReads=65536000
time=2.118000 byteIncrement=2097152 numReadLocations=125 numReads=65536000
time=2.071000 byteIncrement=4194304 numReadLocations=62 numReads=65536000
time=2.036000 byteIncrement=8388608 numReadLocations=31 numReads=65536000
time=1.923000 byteIncrement=16777216 numReadLocations=15 numReads=65536000
time=1.084000 byteIncrement=33554432 numReadLocations=7 numReads=65536000
time=0.607000 byteIncrement=67108864 numReadLocations=3 numReads=65536000
time=0.622000 byteIncrement=134217728 numReadLocations=1 numReads=65536000

Answer 1

Note that most of the below, as any conclusions you drew, is speculative. 请注意，下面的大部分内容，如您提出的任何结论，都是推测性的。 Memory benchmarking is ultra complex, and a relatively naive benchmarking in a way like you have done it rarely gives a lot of definite information about the performance of a real program. 内存基准测试非常复杂，像你这样做的相对天真的基准测试很少提供关于真实程序性能的大量明确信息。

The primary "cost barrier" as you name it at 32 kiB is probably more at 64kiB (or a combination of both). 您将其命名为32 kiB时的主要“成本障碍” 可能更多是64kiB（或两者的组合）。 Since you do not initialize the memory, Windows will pull in zero pages as you read them. 由于您没有初始化内存，因此Windows会在您阅读时拉入零页。 The allocation granularity is 64 kiB, and pages are always "readied" (and prefetched if you memory map) in that size, even if only one of the pages in the 64 kiB range is moved into your working set. 分配粒度为64 kiB，即使只有64 kiB范围内的一个页面移动到您的工作集中，页面也始终以该大小“准备好”（并在内存映射时预取）。 This is something I found out experimenting with memory mapping. 这是我在试验内存映射时发现的。

Your process working set as set by Windows is ridiculously small by default, therefore as you iterate over that memory block, you will constantly have page faults. 默认情况下，Windows设置的进程工作集非常小，因此当您遍历该内存块时，您将始终出现页面错误。 Some are less expensive, only changing a flag in the page descriptor, others (at 64 kiB) are more expensive, pulling 16 new pages from the zero pool (or, in the worst case, if the pool is empty, zeroing pages). 有些更便宜，只更改页面描述符中的标志，其他（64 kiB）更昂贵，从零池中提取16个新页面（或者，在最坏的情况下，如果池为空，则将页面置零）。 This may very well explain one of the "cost barriers" you see. 这可以很好地解释您看到的“成本障碍”之一。

Another cost barrier is, as you correctly noticed, cache associativity. 正如您所注意到的，另一个成本障碍是缓存关联性。 Different addresses at larger power-of-two offsets use the same cache entries. 较大功率的两个偏移处的不同地址使用相同的高速缓存条目。 Given "unhealthy" offsets, one can cause the same cache lines being evicted over and over again. 给定“不健康”的偏移量，可以导致相同的缓存行被一次又一次地逐出。 This is one of the two primary reasons why alignment is good, but excessive over-alignment is bad (the other one being no locality of data). 这是对齐良好的两个主要原因之一，但过度过度对齐是不好的（另一个不是数据的局部性）。

The cost barrier at 32 bytes is surprising, if anything, one could imagine it being at 64 bytes (crossing cache lines on your test architecture). 32字节的成本障碍是令人惊讶的，如果有的话，可以想象它是64字节（在您的测试架构上跨越缓存线）。 Prefetching should for the most part eliminate this kind of stall, but prefetching is usually only activated (if you do not explicitly hint it) after the second cache line miss with a given stride . 预取应该在很大程度上消除这种停顿，但是在第二个缓存行未命中并且给定步幅之后 ，预取通常仅被激活（如果您没有明确地提示它）。

This is perfectly acceptable for "real" programs, which either read just one location and another, or iterate over bulks of data sequentially. 对于“真实”程序来说，这是完全可以接受的，这些程序只读取一个位置和另一个位置，或者按顺序迭代大量数据。 It may, on the other hand, easily give confusing results when doing artificial measurements. 另一方面，在进行人工测量时，它可能容易产生令人困惑的结果。 It may also be a possible explanation why you see a cost barrier at 32 kiB. 这可能是一个可能的解释，为什么你看到32 kiB的成本障碍。 If prefetching doesn't work, then that would be the point where you run out of L1 cache on a typical x86. 如果预取不起作用，那么就会在典型的x86上用完L1缓存。

缓存未命中压力测试：惊人的结果..任何解释？

问题描述

1 个解决方案

解决方案1
4 2012-11-02 11:05:23

缓存未命中压力测试：惊人的结果..任何解释？

问题描述

1 个解决方案

解决方案1 4 2012-11-02 11:05:23

解决方案1
4 2012-11-02 11:05:23