为什么在 ARM 上预取更多数据时，缓存未命中率更高？

Question

I'm using OProfile to profile the following function on a raspberry pi 3B+.我正在使用OProfile在树莓派 3B+ 上分析以下 function。 (I'm using gcc version 10.2 on the raspberry (not doing cross-compilation) and the following flags for the compiler: -O1 -mfpu-neon -mneon-for-64bits . The generate assembly code are included at the end.) （我在覆盆子上使用 gcc 版本 10.2（不进行交叉编译）和编译器的以下标志： -O1 -mfpu-neon -mneon-for-64bits 。最后包含生成的汇编代码。）

void do_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
  for (int i = 0; i < array_size; i++)
  {

    uint32_t tmp1 = b[i];
    uint32_t tmp2 = a[i];
    c[i] = tmp1 * tmp2;
  }
}

I'm looking at L1D_CACHE_REFILL and PREFETCH_LINEFILL these 2 cpu events.我正在查看L1D_CACHE_REFILL和PREFETCH_LINEFILL这两个 cpu 事件。 Looking at the doc , PREFETCH_LINEFILL counts the number of cache line fill because of prefetch, and L1D_CACHE_REFILL counts the number of cache line refill because of cache misses.查看文档， PREFETCH_LINEFILL计算由于预取而导致的缓存行填充数，而L1D_CACHE_REFILL计算由于缓存未命中而重新填充的缓存行数。 I got the following results for the above loop:对于上述循环，我得到了以下结果：

array_size数组大小	array_size / L1D_CACHE_REFILL array_size / L1D_CACHE_REFILL	array_size / PREFETCH_LINEFILL array_size / PREFETCH_LINEFILL
16777216 16777216	18.24 18.24	8.366 8.366

I would imagine the above loop is memory bound, which is somehow confirmed by the value 8.366: Every loop instance needs 3 x uint32_t which is 12B.我想上面的循环是 memory 绑定的，这在某种程度上被值 8.366 证实：每个循环实例需要 3 x uint32_t ，即 12B。 And 8.366 loop instances needs ~100B of data from the memory. 8.366 个循环实例需要来自 memory 的约 100B 数据。 But the prefetcher can only fill 1 cache line to L1 every 8.366 loop instances, which has 64B by the manual of Cortex-A53.但是预取器每 8.366 个循环实例只能将 1 个缓存行填充到 L1，根据 Cortex-A53 的手册，它有 64B。 So the rest of the cache accesses would contribute to cache misses, which is the 18.24.所以缓存访问的 rest 会导致缓存未命中，也就是 18.24。 If you combine these two number, you get ~5.7, that means 1 cache line fill from either prefetch or cache miss refill every 5.7 loop instances.如果将这两个数字结合起来，您将得到 ~5.7，这意味着每 5.7 个循环实例从预取或缓存未命中重新填充中填充 1 个缓存行。 And 5.7 loop instances needs 5.7 x 3 x 4 = 68B, more or less consistent with the cache line size.而 5.7 循环实例需要 5.7 x 3 x 4 = 68B，或多或少与缓存行大小一致。

Then I added more stuff to the loop, which then becomes the following:然后我在循环中添加了更多东西，然后变成以下内容：

void do_more_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
  for (int i = 0; i < array_size; i++)
  {

    uint32_t tmp1 = b[i];
    uint32_t tmp2 = a[i];
    tmp1 = tmp1 * 17;
    tmp1 = tmp1 + 59;
    tmp1 = tmp1 /2;
    tmp2 = tmp2 *27;
    tmp2 = tmp2 + 41;
    tmp2 = tmp2 /11;
    tmp2 = tmp2 + tmp2;
    c[i] = tmp1 * tmp2;
  }
}

And the profiling data of the cpu events is something I don't understand:而 cpu 事件的分析数据是我不明白的：

array_size数组大小	array_size / L1D_CACHE_REFILL array_size / L1D_CACHE_REFILL	array_size / PREFETCH_LINEFILL array_size / PREFETCH_LINEFILL
16777216 16777216	11.24 11.24	7.034 7.034

Since the loop takes longer to execute, the prefetcher now only needs 7.034 loop instances to fill 1 cache line.由于循环需要更长的时间来执行，预取器现在只需要 7.034 个循环实例来填充 1 个缓存行。 But what I don't understand is why cache missed also happens more frequently, reflecting by the number 11.24, compared to 18.24 before?但我不明白的是为什么缓存丢失也更频繁地发生，反映在数字 11.24 上，与之前的 18.24 相比？ Can someone please shed some light on how all these can be put together?有人可以阐明如何将所有这些放在一起吗？

Update to include the generated assembly更新以包含生成的程序集

Loop1:循环1：

    cbz x3, .L178
    lsl x6, x3, 2
    mov x3, 0
.L180:
    ldr w4, [x1, x3]
    ldr w5, [x0, x3]
    mul w4, w4, w5
    lsl w4, w4, 1
    str w4, [x2, x3]
    add x3, x3, 4
    cmp x3, x6
    bne .L180
.L178:

Loop2:循环2：

    cbz x3, .L178
    lsl x6, x3, 2
    mov x5, 0
    mov w8, 27
    mov w7, 35747
    movk    w7, 0xba2e, lsl 16
.L180:
    ldr w3, [x1, x5]
    ldr w4, [x0, x5]
    add w3, w3, w3, lsl 4
    add w3, w3, 59
    mul w4, w4, w8
    add w4, w4, 41
    lsr w3, w3, 1
    umull   x4, w4, w7
    lsr x4, x4, 35
    mul w3, w3, w4
    lsl w3, w3, 1
    str w3, [x2, x5]
    add x5, x5, 4
    cmp x5, x6
    bne .L180
.L178:

Answer 1

I'll try answer my own question based on more measurement and discussion with @artlessnoise.我将尝试根据与@artlessnoise 的更多测量和讨论来回答我自己的问题。

I further measured the READ_ALLOC_ENTER event for the above 2 loops and had the following data:我进一步测量了上述 2 个循环的 READ_ALLOC_ENTER 事件，并获得了以下数据：

Loop 1循环 1

Array Size数组大小	READ_ALLOC_ENTER READ_ALLOC_ENTER
16777216 16777216	12494 12494

Loop 2循环 2

Array Size数组大小	READ_ALLOC_ENTER READ_ALLOC_ENTER
16777216 16777216	1933 1933年

So apparently the small loop (1st) enters Read Allocate Mode a lot more than the big one (2nd), which could be due to the CPU was able to detect consecutive write pattern more easily.所以显然小循环（第一个）比大循环（第二个）进入读取分配模式更多，这可能是由于 CPU 能够更容易地检测到连续写入模式。 In read allocate mode, the stores went directly to L2 (if no hit in L1).在读取分配模式下，存储直接进入 L2（如果 L1 没有命中）。 That's why L1D_CACHE_REFILL is less for the 1st loop since it involves L1 less.这就是为什么 L1D_CACHE_REFILL 对于第一个循环来说更少，因为它涉及的 L1 更少。 For the 2nd loop, since it needs to involve L1 to update c[] more often than the 1st one, refill due to cache miss could be more.对于第二个循环，由于它需要让 L1 比第一个循环更频繁地更新c[] ，因此由于缓存未命中而导致的重新填充可能会更多。 Moreover, for the second case, since L1 is often occupied with more cache lines for c[] , it affects the cache hit rates for a[] and b[] , thus more L1D_CACHE_REFILL.此外，对于第二种情况，由于 L1 通常被更多的c[]缓存行占用，它会影响a[]和b[]的缓存命中率，从而更多的 L1D_CACHE_REFILL。

为什么在 ARM 上预取更多数据时，缓存未命中率更高？

问题描述

Update to include the generated assembly更新以包含生成的程序集

Loop1:循环1：

Loop2:循环2：

1 个解决方案

解决方案1
1 2021-01-22 16:52:18

Loop 1循环 1

Loop 2循环 2

为什么在 ARM 上预取更多数据时，缓存未命中率更高？

问题描述

Update to include the generated assembly更新以包含生成的程序集

Loop1:循环1：

Loop2:循环2：

1 个解决方案

解决方案1 1 2021-01-22 16:52:18

Loop 1循环 1

Loop 2循环 2

解决方案1
1 2021-01-22 16:52:18