为什么第二个程序的性能更差，即使它应该有更少的缓存未命中？

Question

Consider the following programs:考虑以下程序：

#include <stdio.h>
#include <stdlib.h>

typedef unsigned long long u64;

int program_1(u64* a, u64* b)
{
  const u64 lim = 50l * 1000l * 1000l;
  // Reads arrays
  u64 sum = 0;
  for (u64 i = 0; i < lim * 100; ++i) {
    sum += a[i % lim];
    sum += b[i % lim];
  }

  printf("%llu\n", sum);
  return 0;
}


int program_2(u64* a, u64* b)
{
  const u64 lim = 50l * 1000l * 1000l;
  // Reads arrays
  u64 sum = 0;
  for (u64 i = 0; i < lim * 100; ++i) {
    sum += a[i % lim];
  }
  for (u64 i = 0; i < lim * 100; ++i) {
    sum += b[i % lim];
  }

  printf("%llu\n", sum);
  return 0;
}

Both programs are identical: they fill up an array with 1s, then read every element 100 times, adding to a counter.这两个程序是相同的：它们用 1 填充一个数组，然后读取每个元素 100 次，并添加到一个计数器中。 The only difference is the first one fuses the adder loop, while the second one performs two separate passes.唯一的区别是第一个融合了加法器循环，而第二个执行了两次单独的传递。 Given that M1 has a 64KB of L1 data cache, my understanding is that the following would happen:鉴于 M1 有 64KB 的 L1 数据缓存，我的理解是会发生以下情况：

Program 1程序 1

sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += a[1] // CACHE MISS. Load a[0..8192] on L1.
sum += b[1] // CACHE MISS. Load b[0..8192] on L1.
sum += a[2] // CACHE MISS. Load a[0..8192] on L1.
sum += b[2] // CACHE MISS. Load b[0..8192] on L1.
(...)

Program 2节目二

sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += a[1] // CACHE HIT!
sum += a[2] // CACHE HIT!
sum += a[3] // CACHE HIT!
sum += a[4] // CACHE HIT!
...
sum += a[8192] // CACHE MISS. Load a[8192..16384] on L1.
sum += a[8193] // CACHE HIT!
sum += a[8194] // CACHE HIT!
sum += a[8195] // CACHE HIT!
sum += a[8196] // CACHE HIT!
...
...
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += b[1] // CACHE HIT!
sum += b[2] // CACHE HIT!
sum += b[3] // CACHE HIT!
sum += b[4] // CACHE HIT!
...

This would lead me to believe that the first program is slower, since every read is a cache miss, while the second one consists majorly of cache hits.这会让我相信第一个程序更慢，因为每次读取都是缓存未命中，而第二个主要由缓存命中组成。 The results, though, differ.但是，结果不同。 Running on a Macbook Pro M1, with clang -O2 , the first program takes 2.8s to complete, while the second one takes about 3.8s.在 Macbook Pro M1 上运行clang -O2 ，第一个程序需要 2.8 秒才能完成，而第二个大约需要 3.8 秒。

What is wrong about my mental model of how the L1 cache works? L1缓存如何工作的我的心理model有什么问题？

Answer 1

I'd expect that:我希望：

a) while the CPU is waiting for data to be fetched into L1 for the sum += a[i % lim]; a) 当 CPU 等待数据被提取到 L1 中时， sum += a[i % lim]; it can ask for data to be fetched for the sum += b[i % lim];它可以要求为sum += b[i % lim]; into L1.进入L1。 Essentially;本质上; Program 1 is waiting for 2 cache misses in parallel while Program 2 is waiting for 1 cache miss at a time and could be up to twice as slow.程序 1 正在并行等待 2 次缓存未命中，而程序 2 一次等待 1 次缓存未命中，并且可能会慢两倍。

b) The loop overhead (all the work in for (u64 i = 0; i < lim * 100; ++i) { ), and the indexing (calculating i%lim ) is duplicated in Program 2; b) 循环开销（ for (u64 i = 0; i < lim * 100; ++i) {中的所有工作）和索引（计算i%lim ）在程序 2 中重复； causing Program 2 to do almost twice as much extra work (that has nothing to do with cache misses).导致程序 2 做几乎两倍的额外工作（这与缓存未命中无关）。

c) The compiler is bad at optimising. c) 编译器不擅长优化。 I'm surprised the same code wasn't generated for both versions.我很惊讶没有为两个版本生成相同的代码。 I'm shocked that neither CLANG nor GCC managed to auto-vectorize (use SIMD).我很震惊 CLANG 和 GCC 都无法自动矢量化（使用 SIMD）。 A very hypothetical idealized perfect compiler should be able to optimize both versions all the way down to write(STDOUT_FILENO, "10000000000\n", 12); return 0一个非常假设的理想化完美编译器应该能够一直优化两个版本到write(STDOUT_FILENO, "10000000000\n", 12); return 0 write(STDOUT_FILENO, "10000000000\n", 12); return 0 . write(STDOUT_FILENO, "10000000000\n", 12); return 0 。

What is wrong about my mental model of how the L1 cache works? L1缓存如何工作的我的心理model有什么问题？

It looks like you thought the cache can only cache one thing at a time.看起来您认为缓存一次只能缓存一件事。 For Program 1 it would be more like:对于程序 1，它更像是：

sum += a[0] // CACHE MISS
sum += b[0] // CACHE MISS
sum += a[1] // CACHE HIT (data still in cache)
sum += b[1] // CACHE HIT (data still in cache)
sum += a[2] // CACHE HIT (data still in cache)
sum += b[2] // CACHE HIT (data still in cache)
sum += a[3] // CACHE HIT (data still in cache)
sum += b[3] // CACHE HIT (data still in cache)
sum += a[4] // CACHE HIT (data still in cache)
sum += b[4] // CACHE HIT (data still in cache)
sum += a[5] // CACHE HIT (data still in cache)
sum += b[5] // CACHE HIT (data still in cache)
sum += a[6] // CACHE HIT (data still in cache)
sum += b[6] // CACHE HIT (data still in cache)
sum += a[7] // CACHE HIT (data still in cache)
sum += b[7] // CACHE HIT (data still in cache)

sum += a[8] // CACHE MISS
sum += b[8] // CACHE MISS

For program 2 it's probably (see note) the same number of cache misses in a different order:对于程序 2，它可能（见注）相同数量的缓存未命中，但顺序不同：

sum += a[0] // CACHE MISS
sum += a[1] // CACHE HIT (data still in cache)
sum += a[2] // CACHE HIT (data still in cache)
sum += a[3] // CACHE HIT (data still in cache)
sum += a[4] // CACHE HIT (data still in cache)
sum += a[5] // CACHE HIT (data still in cache)
sum += a[6] // CACHE HIT (data still in cache)
sum += a[7] // CACHE HIT (data still in cache)

sum += a[8] // CACHE MISS

..then: ..然后：

sum += b[0] // CACHE MISS
sum += b[1] // CACHE HIT (data still in cache)
sum += b[2] // CACHE HIT (data still in cache)
sum += b[3] // CACHE HIT (data still in cache)
sum += b[4] // CACHE HIT (data still in cache)
sum += b[5] // CACHE HIT (data still in cache)
sum += b[6] // CACHE HIT (data still in cache)
sum += b[7] // CACHE HIT (data still in cache)

sum += b[8] // CACHE MISS

NOTE: I assumed any array is larger than cache.注意：我假设任何数组都大于缓存。 If cache was large enough to hold an entire array but too small to hold both arrays;如果缓存足够大以容纳整个阵列，但太小而无法容纳 arrays； then Program 2 would probably be faster than Program 1. This is the only case where Program 2 would be faster.那么程序 2 可能会比程序 1 更快。这是唯一程序 2 会更快的情况。

为什么第二个程序的性能更差，即使它应该有更少的缓存未命中？

问题描述

Program 1程序 1

Program 2节目二

1 个解决方案

解决方案1
1 已采纳 2021-12-03 00:37:52

为什么第二个程序的性能更差，即使它应该有更少的缓存未命中？

问题描述

Program 1程序 1

Program 2节目二

1 个解决方案

解决方案1 1 已采纳 2021-12-03 00:37:52

解决方案1
1 已采纳 2021-12-03 00:37:52