[英]Why the second program performs worse, even though it should have considerably less cache misses?
Consider the following programs:考虑以下程序:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long long u64;
int program_1(u64* a, u64* b)
{
const u64 lim = 50l * 1000l * 1000l;
// Reads arrays
u64 sum = 0;
for (u64 i = 0; i < lim * 100; ++i) {
sum += a[i % lim];
sum += b[i % lim];
}
printf("%llu\n", sum);
return 0;
}
int program_2(u64* a, u64* b)
{
const u64 lim = 50l * 1000l * 1000l;
// Reads arrays
u64 sum = 0;
for (u64 i = 0; i < lim * 100; ++i) {
sum += a[i % lim];
}
for (u64 i = 0; i < lim * 100; ++i) {
sum += b[i % lim];
}
printf("%llu\n", sum);
return 0;
}
Both programs are identical: they fill up an array with 1s, then read every element 100 times, adding to a counter.这两个程序是相同的:它们用 1 填充一个数组,然后读取每个元素 100 次,并添加到一个计数器中。 The only difference is the first one fuses the adder loop, while the second one performs two separate passes.唯一的区别是第一个融合了加法器循环,而第二个执行了两次单独的传递。 Given that M1 has a 64KB of L1 data cache, my understanding is that the following would happen:鉴于 M1 有 64KB 的 L1 数据缓存,我的理解是会发生以下情况:
sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += a[1] // CACHE MISS. Load a[0..8192] on L1.
sum += b[1] // CACHE MISS. Load b[0..8192] on L1.
sum += a[2] // CACHE MISS. Load a[0..8192] on L1.
sum += b[2] // CACHE MISS. Load b[0..8192] on L1.
(...)
sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += a[1] // CACHE HIT!
sum += a[2] // CACHE HIT!
sum += a[3] // CACHE HIT!
sum += a[4] // CACHE HIT!
...
sum += a[8192] // CACHE MISS. Load a[8192..16384] on L1.
sum += a[8193] // CACHE HIT!
sum += a[8194] // CACHE HIT!
sum += a[8195] // CACHE HIT!
sum += a[8196] // CACHE HIT!
...
...
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += b[1] // CACHE HIT!
sum += b[2] // CACHE HIT!
sum += b[3] // CACHE HIT!
sum += b[4] // CACHE HIT!
...
This would lead me to believe that the first program is slower, since every read is a cache miss, while the second one consists majorly of cache hits.这会让我相信第一个程序更慢,因为每次读取都是缓存未命中,而第二个主要由缓存命中组成。 The results, though, differ.但是,结果不同。 Running on a Macbook Pro M1, with clang -O2
, the first program takes 2.8s to complete, while the second one takes about 3.8s.在 Macbook Pro M1 上运行clang -O2
,第一个程序需要 2.8 秒才能完成,而第二个大约需要 3.8 秒。
What is wrong about my mental model of how the L1 cache works? L1缓存如何工作的我的心理model有什么问题?
I'd expect that:我希望:
a) while the CPU is waiting for data to be fetched into L1 for the sum += a[i % lim];
a) 当 CPU 等待数据被提取到 L1 中时, sum += a[i % lim];
it can ask for data to be fetched for the sum += b[i % lim];
它可以要求为sum += b[i % lim];
into L1.进入L1。 Essentially;本质上; Program 1 is waiting for 2 cache misses in parallel while Program 2 is waiting for 1 cache miss at a time and could be up to twice as slow.程序 1 正在并行等待 2 次缓存未命中,而程序 2 一次等待 1 次缓存未命中,并且可能会慢两倍。
b) The loop overhead (all the work in for (u64 i = 0; i < lim * 100; ++i) {
), and the indexing (calculating i%lim
) is duplicated in Program 2; b) 循环开销( for (u64 i = 0; i < lim * 100; ++i) {
中的所有工作)和索引(计算i%lim
)在程序 2 中重复; causing Program 2 to do almost twice as much extra work (that has nothing to do with cache misses).导致程序 2 做几乎两倍的额外工作(这与缓存未命中无关)。
c) The compiler is bad at optimising. c) 编译器不擅长优化。 I'm surprised the same code wasn't generated for both versions.我很惊讶没有为两个版本生成相同的代码。 I'm shocked that neither CLANG nor GCC managed to auto-vectorize (use SIMD).我很震惊 CLANG 和 GCC 都无法自动矢量化(使用 SIMD)。 A very hypothetical idealized perfect compiler should be able to optimize both versions all the way down to write(STDOUT_FILENO, "10000000000\n", 12); return 0
一个非常假设的理想化完美编译器应该能够一直优化两个版本到write(STDOUT_FILENO, "10000000000\n", 12); return 0
write(STDOUT_FILENO, "10000000000\n", 12); return 0
. write(STDOUT_FILENO, "10000000000\n", 12); return 0
。
What is wrong about my mental model of how the L1 cache works? L1缓存如何工作的我的心理model有什么问题?
It looks like you thought the cache can only cache one thing at a time.看起来您认为缓存一次只能缓存一件事。 For Program 1 it would be more like:对于程序 1,它更像是:
sum += a[0] // CACHE MISS
sum += b[0] // CACHE MISS
sum += a[1] // CACHE HIT (data still in cache)
sum += b[1] // CACHE HIT (data still in cache)
sum += a[2] // CACHE HIT (data still in cache)
sum += b[2] // CACHE HIT (data still in cache)
sum += a[3] // CACHE HIT (data still in cache)
sum += b[3] // CACHE HIT (data still in cache)
sum += a[4] // CACHE HIT (data still in cache)
sum += b[4] // CACHE HIT (data still in cache)
sum += a[5] // CACHE HIT (data still in cache)
sum += b[5] // CACHE HIT (data still in cache)
sum += a[6] // CACHE HIT (data still in cache)
sum += b[6] // CACHE HIT (data still in cache)
sum += a[7] // CACHE HIT (data still in cache)
sum += b[7] // CACHE HIT (data still in cache)
sum += a[8] // CACHE MISS
sum += b[8] // CACHE MISS
For program 2 it's probably (see note) the same number of cache misses in a different order:对于程序 2,它可能(见注)相同数量的缓存未命中,但顺序不同:
sum += a[0] // CACHE MISS
sum += a[1] // CACHE HIT (data still in cache)
sum += a[2] // CACHE HIT (data still in cache)
sum += a[3] // CACHE HIT (data still in cache)
sum += a[4] // CACHE HIT (data still in cache)
sum += a[5] // CACHE HIT (data still in cache)
sum += a[6] // CACHE HIT (data still in cache)
sum += a[7] // CACHE HIT (data still in cache)
sum += a[8] // CACHE MISS
..then: ..然后:
sum += b[0] // CACHE MISS
sum += b[1] // CACHE HIT (data still in cache)
sum += b[2] // CACHE HIT (data still in cache)
sum += b[3] // CACHE HIT (data still in cache)
sum += b[4] // CACHE HIT (data still in cache)
sum += b[5] // CACHE HIT (data still in cache)
sum += b[6] // CACHE HIT (data still in cache)
sum += b[7] // CACHE HIT (data still in cache)
sum += b[8] // CACHE MISS
NOTE: I assumed any array is larger than cache.注意:我假设任何数组都大于缓存。 If cache was large enough to hold an entire array but too small to hold both arrays;如果缓存足够大以容纳整个阵列,但太小而无法容纳 arrays; then Program 2 would probably be faster than Program 1. This is the only case where Program 2 would be faster.那么程序 2 可能会比程序 1 更快。这是唯一程序 2 会更快的情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.