英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能

Question

I'm seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that's always to the same location ¹ , like this: 我看到一个简单的存储循环出乎意料地表现不佳，这个存储循环有两个存储：一个具有16字节的正向步长，另一个总是位于同一位置¹ ，如下所示：

volatile uint32_t value;

void weirdo_cpp(size_t iters, uint32_t* output) {

    uint32_t x = value;
    uint32_t          *rdx = output;
    volatile uint32_t *rsi = output;
    do {
        *rdx    = x;
        *rsi = x;

        rdx += 4;  // 16 byte stride
    } while (--iters > 0);
}

In assembly this loop probably ³ looks like: 在汇编这个循环可能³看起来像：

weirdo_cpp:

...

align 16
.top:
    mov    [rdx], eax  ; stride 16
    mov    [rsi], eax  ; never changes

    add    rdx, 16

    dec    rdi
    jne    .top

    ret

When the memory region accessed is in L2 I would expect this to run at less than 3 cycles per iteration. 当访问的存储区域在L2中时，我希望每次迭代运行少于3个周期。 The second store just keeps hitting the same location and should add about a cycle. 第二个商店只是一直在同一个位置，应该添加一个周期。 The first store implies bringing in a line from L2 and hence also evicting a line once every 4 iterations . 第一个商店意味着从L2引入一条线，因此每4次迭代也会驱逐一条线。 I'm not sure how you evaluate the L2 cost, but even if you conservatively estimate that the L1 can only do one of the following every cycle: (a) commit a store or (b) receive a line from L2 or (c) evict a line to L2, you'd get something like 1 + 0.25 + 0.25 = 1.5 cycles for the stride-16 store stream. 我不确定你如何评估L2成本，但即使你保守估计L1只能在每个周期中执行以下操作之一：（a）提交商店或（b）从L2接收一行或（c）将一条线驱逐到L2，对于stride-16商店流，你会得到1 + 0.25 + 0.25 = 1.5个周期。

Indeed, you comment out one store you get ~1.25 cycles per iteration for the first store only, and ~1.01 cycles per iteration for the second store, so 2.5 cycles per iteration seems like a conservative estimate. 实际上，你注释掉一个商店你得到的第一个商店每次迭代约1.25个周期，第二个商店每个迭代约1.01个周期，所以每次迭代2.5个周期似乎是一个保守的估计。

The actual performance is very odd, however. 然而，实际表现非常奇怪。 Here's a typical run of the test harness: 这是测试工具的典型运行：

Estimated CPU speed:  2.60 GHz
output size     :   64 KiB
output alignment:   32
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.89 cycles/iter,  1.49 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 4.73 cycles/iter,  1.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.34 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.26 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.31 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.27 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0

Two things are weird here. 这里有两件事很奇怪。

First are the bimodal timings: there is a fast mode and a slow mode . 首先是双峰时序：有快速模式和慢速模式 。 We start out in slow mode taking about 7.3 cycles per iteration, and at some point transition to about 3.9 cycles per iteration. 我们从慢速模式开始，每次迭代大约需要7.3个周期，并且在某些时候每次迭代过渡到大约3.9个周期。 This behavior is consistent and reproducible and the two timings are always quite consistent clustered around the two values. 这种行为是一致的和可重复的，并且两个时间总是非常一致地聚集在两个值周围。 The transition shows up in both directions from slow mode to fast mode and the other way around (and sometimes multiple transitions in one run). 转换显示在从慢速模式到快速模式的两个方向上， 反之亦然 （有时在一次运行中有多个转换）。

The other weird thing is the really bad performance. 另一个奇怪的是表现非常糟糕。 Even in fast mode , at about 3.9 cycles the performance is much worse than the 1.0 + 1.3 = 2.3 cycles worst cast you'd expect from adding together the each of the cases with a single store (and assuming that absolutely zero worked can be overlapped when both stores are in the loop). 即使在快速模式下 ，在大约3.9个周期内，性能也比1.0 + 1.3 = 2.3周期差得多，您希望将每个案例与单个商店相加（假设绝对零工作可以重叠）当两个商店都在循环中）。 In slow mode , performance is terrible compared to what you'd expect based on first principles: it is taking 7.3 cycles to do 2 stores, and if you put it in L2 store bandwidth terms, that's roughly 29 cycles per L2 store (since we only store one full cache line every 4 iterations). 在慢速模式下 ，与基于第一原则所期望的相比，性能非常糟糕：它需要7.3个周期来完成2个存储，如果你把它放在L2存储带宽术语中，那么每个L2存储大约需要29个周期 （因为我们每4次迭代只存储一个完整的缓存行）。

Skylake is recorded as having a 64B/cycle throughput between L1 and L2, which is way higher than the observed throughput here (about 2 bytes/cycle in slow mode ). Skylake被记录为在L1和L2之间具有64B /循环吞吐量，这远高于此处观察到的吞吐量（在慢速模式下约2个字节/周期）。

What explains the poor throughput and bimodal performance and can I avoid it? 什么解释了吞吐量和双峰性能差，我可以避免它吗？

I'm also curious if this reproduces on other architectures and even on other Skylake boxes. 如果这在其他架构甚至其他Skylake盒子上再现，我也很好奇。 Feel free to include local results in the comments. 随意在评论中包含本地结果。

You can find the test code and harness on github . 您可以在github上找到测试代码和线束。 There is a Makefile for Linux or Unix-like platforms, but it should be relatively easy to build on Windows too. 有一个适用于Linux或类Unix平台的Makefile ，但在Windows上也应该相对容易。 If you want to run the asm variant you'll need nasm or yasm for the assembly ⁴ - if you don't have that you can just try the C++ version. 如果你想运行asm变体，你需要使用nasm或yasm作为程序集⁴ - 如果你没有，你可以尝试C ++版本。

Eliminated Possibilities 消除了可能性

Here are some possibilities that I considered and largely eliminated. 以下是我考虑并在很大程度上消除的一些可能性。 Many of the possibilities are eliminated by the simple fact that you see the performance transition randomly in the middle of the benchmarking loop , when many things simply haven't changed (eg, if it was related to the output array alignment, it couldn't change in the middle of a run since the same buffer is used the entire time). 很多可能性都被简单的事实所消除，你可以在基准测试循环的中间随机看到性能转换，当许多事情根本没有改变时（例如，如果它与输出数组对齐有关，它就不能因为整个时间使用相同的缓冲区，所以在运行中间进行更改。 I'll refer to this as the default elimination below (even for things that are default elimination there is often another argument to be made). 我将在下面将其称为默认消除 （即使对于默认消除的事物，通常还会有另一个参数）。

Alignment factors: the output array is 16 byte aligned, and I've tried up to 2MB alignment without change. 对齐因子：输出数组是16字节对齐，我尝试了2MB对齐而没有改变。 Also eliminated by the default elimination . 也通过默认消除消除 。
Contention with other processes on the machine: the effect is observed more or less identically on an idle machine and even on a heavily loaded one (eg, using stress -vm 4 ). 与机器上的其他过程争用：在空闲机器上甚至在负载较重的机器上观察到的效果或多或少相同（例如，使用stress -vm 4 ）。 The benchmark itself should be completely core-local anyways since it fits in L2, and perf confirms there are very few L2 misses per iteration (about 1 miss every 300-400 iterations, probably related to the printf code). 基准测试本身应该完全是核心本地的，因为它适合L2，并且perf确认每次迭代很少有L2未命中（每300-400次迭代大约1次错过，可能与printf代码有关）。
TurboBoost: TurboBoost is completely disabled, confirmed by three different MHz readings. TurboBoost：完全禁用TurboBoost，由三个不同的MHz读数确认。
Power-saving stuff: The performance governor is intel_pstate in performance mode. 省电的东西：性能intel_pstate在performance模式下是intel_pstate 。 No frequency variations are observed during the test (CPU stays essentially locked at 2.59 GHz). 在测试期间没有观察到频率变化（CPU基本上保持锁定在2.59GHz）。
TLB effects: The effect is present even when the output buffer is located in a 2 MB huge page. TLB效果：即使输出缓冲区位于2 MB大页面中，也会出现效果。 In any case, the 64 4k TLB entries more than cover the 128K output buffer. 在任何情况下，64个4k TLB条目都覆盖了128K输出缓冲区。 perf doesn't report any particularly weird TLB behavior. perf没有报告任何特别奇怪的TLB行为。
4k aliasing: older, more complex versions of this benchmark did show some 4k aliasing but this has been eliminated since there are no loads in the benchmark (it's loads that might incorrectly alias earlier stores). 4k别名：这个基准测试的较旧，更复杂的版本确实显示了一些4k别名，但由于基准测试中没有负载 （它的负载可能会错误地为早期存储设置别名），因此已经消除了这种情况。 Also eliminated by the default elimination . 也通过默认消除消除 。
L2 associativity conflicts: eliminated by the default elimination and by the fact that this doesn't go away even with 2MB pages, where we can be sure the output buffer is laid out linearly in physical memory. L2关联性冲突：通过默认消除消除，并且即使使用2MB页面也不会消失，我们可以确保输出缓冲区在物理内存中线性布局。
Hyperthreading effects: HT is disabled. 超线程效应：HT被禁用。
Prefetching: Only two of the prefetchers could be involved here (the "DCU", aka L1<->L2 prefetchers), since all the data lives in L1 or L2, but the performance is the same with all prefetchers enabled or all disabled. 预取：此处只能涉及两个预取程序（“DCU”，又称L1 < - > L2预取程序），因为所有数据都存在于L1或L2中，但所有预取程序启用或全部禁用时性能相同。
Interrupts: no correlation between interrupt count and slow mode. 中断：中断计数和慢速模式之间没有相关性。 There is a limited number of total interrupts, mostly clock ticks. 总中断数量有限，主要是时钟周期。

toplev.py toplev.py

I used toplev.py which implements Intel's Top Down analysis method, and to no surprise it identifies the benchmark as store bound: 我使用了toplev.py来实现英特尔的Top Down分析方法，毫不奇怪它将基准标识为存储绑定：

BE             Backend_Bound:                                                      82.11 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound:                                         59.64 % Slots      [  4.83%]
BE/Core        Backend_Bound.Core_Bound:                                           22.47 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                                 0.03 % Stalls     [  4.92%]
    This metric estimates how often the CPU was stalled without
    loads missing the L1 data cache...
    Sampling events:  mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                             74.91 % Stalls     [  4.96%] <==
    This metric estimates how often CPU was stalled  due to
    store memory accesses...
    Sampling events:  mem_inst_retired.all_stores:pp
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:                         28.20 % Clocks     [  4.93%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized:         26.28 % CoreClocks [  4.83%]
    This metric represents Core cycles fraction where the CPU
    executed total of 1 uop per cycle on all execution ports...
               MUX:                                                                 4.65 %           
    PerfMon Event Multiplexing accuracy indicator

This doesn't really shed much light: we already knew it must be the stores messing things up, but why? 这并没有真正说明问题：我们已经知道必须要把商店搞砸了，但为什么呢？ Intel's description of the condition doesn't say much. 英特尔对这种情况的描述并没有多说。

Here's a reasonable summary of some of the issues involved in L1-L2 interaction. 以下是 L1-L2交互中涉及的一些问题的合理总结。

Update Feb 2019: I cannot no longer reproduce the "bimodal" part of the performance: for me, on the same i7-6700HQ box, the performance is now always very slow in the same cases the slow and very slow bimodal performance applies, ie, with results around 16-20 cycles per line, like this: 更新2019年2月：我不能再重现性能的“双峰”部分：对于我来说，在相同的i7-6700HQ盒子上，在相同的情况下，性能现在总是很慢，双峰性能缓慢且非常慢，即，结果大约每行16-20个周期，如下所示：

This change seems to have been introduced in the August 2018 Skylake microcode update, revision 0xC6. 这一变化似乎已在2018年8月Skylake微码更新版本0xC6中引入。 The prior microcode, 0xC2 shows the original behavior described in the question. 先前的微码0xC2显示了问题中描述的原始行为。

¹ This is a greatly simplified MCVE of my original loop, which was at least 3 times the size and which did lots of additional work, but exhibited exactly the same performance as this simple version, bottlenecked on the same mysterious issue. ¹这是我原始循环的一个大大简化的MCVE，它至少是大小的3倍，并且做了很多额外的工作，但表现出与这个简单版本完全相同的性能，瓶颈在同一个神秘的问题上。

³ In particular, it looks exactly like this if you write the assembly by hand, or if you compile it with gcc -O1 (version 5.4.1), and probably most reasonable compilers ( volatile is used to avoid sinking the mostly-dead second store outside the loop). ³特别是，它看起来就像这个，如果你手工编写的组件，或者如果你编译它gcc -O1 （版本5.4.1），也可能是最合理的编译器（ volatile用来避免下沉大多死亡第二在循环外存储）。

⁴ No doubt you could convert this to MASM syntax with a few minor edits since the assembly is so trivial. ⁴毫无疑问，您可以通过一些小的编辑将其转换为MASM语法，因为程序集非常简单。 Pull requests accepted. 接受拉请求。

Answer 1

What I've found so far. 到目前为止我发现了什么。 Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it: 不幸的是，它并没有真正提供对性能不佳的解释，也没有为双峰分布提供解释，但更多的是一套规则，以便您何时可以看到性能和减轻它的注意事项：

The store throughput into L2 appears to be at most one 64-byte cache-line per three cycles ⁰ , putting a ~21 bytes per cycle upper limit on store throughput. L2中的存储吞吐量似乎是每三个周期⁰最多一个64字节高速缓存行，每个周期的存储吞吐量上限约为21个字节。 Said another way, series of stores that miss in L1 and hit in L2 will take at least three cycles per cache line touched. 换句话说，在L1中错过并且在L2中命中的一系列商店将在每个所触摸的高速缓存行中花费至少三个周期。
Above that baseline there is a significant penalty when stores that hit in L2 are interleaved with stores to a different cache line (regardless of whether those stores hit in L1 or L2). 在该基线之上，当在L2中命中的商店与商店交织到不同的高速缓存行时 （无论这些商店是否在L1或L2中命中），存在显着的惩罚。
The penalty is apparently somewhat larger for stores that are nearby (but still not in the same cache line). 对于附近的商店（但仍然不在同一缓存行中），罚款显然略大。
The bimodal performance is at least superficially related to above effect since in the non-interleaving case it does not appear to occur, although I don't have a further explanation for it. 双峰性能至少表面上与上述效应相关，因为在非交错情况下它似乎不会发生，尽管我没有进一步的解释。
If you ensure the cache line is already in L1 before the store, by prefetch or a dummy load, the slow performance disappears and the performance is no longer bimodal. 如果确保缓存行在存储之前已经在L1中，则通过预取或虚拟加载，缓慢的性能消失，性能不再是双峰的。

Details and Pictures 细节和图片

64-byte Stride 64字节的步幅

The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, ie, one full cache line. 原始问题任意使用了16的步幅，但让我们从最简单的情况开始：64的步幅，即一个完整的缓存行。 As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables. 事实证明，任何步幅都可以看到各种效果，但64确保每次步幅都会出现L2缓存缺失，因此会删除一些变量。

Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory: 我们现在也要删除第二个商店 - 所以我们只测试64K内存的64字节跨步存储：

top:
mov    BYTE PTR [rdx],al
add    rdx,0x40
sub    rdi,0x1
jne    top

Running this in the same harness as above, I get about 3.05 cycles/store ² , although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there). 在与上面相同的线束中运行它，我得到大约3.05个周期/存储² ，尽管与我以前看到的相比有很多变化（ - 你甚至可以在那里找到3.0）。

So we know already we probably aren't going to do better than this for sustained stores purely to L2 ¹ . 因此，我们已经知道了，我们可能都不会做得比这对于持续店更好纯属L2 ^1。 While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1. 虽然Skylake在L1和L2之间显然具有64字节的吞吐量，但在存储流的情况下，必须为从L1驱逐的两个驱动共享带宽，并将新线路加载到L1中。 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1. 3个循环似乎是合理的，如果它们分别需要1个循环来（a）将脏的受害者线从L1驱逐到L2（b）用来自L2的新行更新L1并且（c）将存储器提交到L1。

What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? 当你在循环中添加第二次写入同一个高速缓存行（到下一个字节，虽然结果不重要）时会发生什么？ Like this: 像这样：

top:
mov    BYTE PTR [rdx],al
mov    BYTE PTR [rdx+0x1],al
add    rdx,0x40
sub    rdi,0x1
jne    top

Here's a histogram of the timing for 1000 runs of the test harness for the above loop: 以下是上述循环的1000次测试工具运行时间的直方图：

  count   cycles/itr
      1   3.0
     51   3.1
      5   3.2
      5   3.3
     12   3.4
    733   3.5
    139   3.6
     22   3.7
      2   3.8
     11   4.0
     16   4.1
      1   4.3
      2   4.4

So the majority of times are clustered around 3.5 cycles. 所以大多数时间都聚集在3.5个周期左右。 That means that this additional store only added 0.5 cycles to the timing. 这意味着这个额外的商店只增加了0.5个周期。 It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time. 这可能是类似于存储缓冲区能够将两个存储器排放到L1，如果它们在同一行中，但这只发生了大约一半的时间。

Consider that the store buffer contains a series of stores like 1, 1, 2, 2, 3, 3 where 1 indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. 考虑存储缓冲区包含一系列存储，如1, 1, 2, 2, 3, 3其中1表示缓存行：一半位置具有来自同一缓存行的两个连续值，一半不具有。 As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1 maybe the stores drain in one cycle, but if it's at 1, 2 it takes two cycles. 由于存储缓冲区正在等待耗尽存储，并且L1忙于驱逐并接受来自L2的线路，因此L1将在“任意”点处可用于存储，并且如果它位于位置1, 1可能是存储在一个周期中耗尽，但如果它在1, 2它需要两个周期。

Note there is another peak of about 6% of results around 3.1 rather than 3.5. 请注意，在3.1左右，而不是3.5左右，有另外一个峰值约为6％的结果。 That could be a steady state where we always get the lucky outcome. 这可能是一个稳定的状态，我们总能得到幸运的结果。 There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement. 在4.0-4.1时还有另外一个大约3％的峰值 - “总是不吉利”的安排。

Let's test this theory by looking at various offsets between the first and second stores: 让我们通过查看第一个和第二个商店之间的各种偏移来测试这个理论：

top:
mov    BYTE PTR [rdx + FIRST],al
mov    BYTE PTR [rdx + SECOND],al
add    rdx,0x40
sub    rdi,0x1
jne    top

We try all values of FIRST and SECOND from 0 to 256 in steps of 8. The results, with varying FIRST values on the vertical axis and SECOND on the horizontal: 我们尝试FIRST和SECOND所有值，从0到256，步长为8.结果，垂直轴上的FIRST值和水平线上的SECOND值不同：

We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). 我们看到一个特定的模式 - 白色值是“快速”（大约上面讨论的3.0-4.1值，偏移量为1）。 Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter). 黄色值较高，最多8个循环，红色最多10个。紫色异常值最高，通常是OP中描述的“慢速模式”开始的情况（通常以18.0个周期/秒计时）。 We notice the following: 我们注意到以下内容：

From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. 从白色单元格的模式来看，只要第二个存储位于同一缓存行或下一个存储位于第一个存储区中，我们就会看到快速~3.5周期结果。 This is consistent with the idea above that stores to the same cache line are handled more efficiently. 这与上面的想法一致，即更有效地处理对相同高速缓存行的存储。 The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access: 0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - where in the second case it is the second store that first touches each cache line. 在下一个缓存行中使用第二个存储的原因是该模式最终是相同的，除了第一个第一次访问： 0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - 在第二种情况下，它是第一个接触每个高速缓存行的第二个存储。 The store buffer doesn't care though. 但是商店缓冲区并不在意。 As soon as you get into different cache lines, you get a pattern like 0, 2, 1, 3, 2, ... and apparently this sucks? 一旦你进入不同的缓存行，你会得到一个类似0,2,1,3,2 0, 2, 1, 3, 2, ...显然这很糟糕？
The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles). 紫色的“异常值”永远不会出现在白色区域中，因此显然局限于已经很慢的场景（而且这里的慢点使得它慢大约2.5倍：从大约8到18个周期）。

We can zoom out a bit and look at even larger offsets: 我们可以缩小一点，看看更大的偏移量：

The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. 相同的基本模式，虽然我们看到性能提高（绿色区域），因为第二个存储距离越来越远（前面或后面）第一个，直到它在约1700字节的偏移处再次变坏。 Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5. 即使在改进的区域，我们也只能达到5.8次循环/迭代仍然比3.5的同线性能差得多。

If you add any kind of load or prefetch instruction that runs ahead ³ of the stores, both the overall slow performance and the "slow mode" outliers disappear: 如果添加任何类型的加载或预取指令，这些指令在³个存储区之前运行，则整体缓慢性能和“慢速模式”异常值都会消失：

You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops. 您可以将此端口移回原来的16步问题 - 核心循环中的任何类型的预取或加载，对距离非常不敏感（即使它实际上落后），修复问题并获得2.3周期/迭代，接近2.0的最佳理想值，并且等于具有单独循环的两个商店的总和。

So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. 因此，基本规则是没有相应负载的L2存储比软件预取它们要慢得多 - 除非整个存储流以单个顺序模式访问缓存行。 That's contrary to the idea that a linear pattern like this never benefits from SW prefetch. 这与像这样的线性模式永远不会受益于SW预取的想法相反。

I don't really have a fleshed out explanation, but it could include these factors: 我真的没有一个充实的解释，但它可能包括这些因素：

Having other stores in the store buffers may reduce the concurrency of the requests going to L2. 在存储缓冲区中具有其他存储可能会降低发送到L2的请求的并发性。 It isn't clear exactly when stores that are going to miss in L1 allocate a store buffer, but perhaps it occurs near when the store is going to retire and there is a certain amount of "lookhead" into the store buffer to bring locations into L1, so having additional stores that aren't going to miss in L1 hurts the concurrency since the lookahead can't see as many requests that will miss. 目前尚不清楚在L1中将要错过的商店何时分配商店缓冲区，但也许它恰好在商店即将退休并且存储缓冲区中存在一定数量的“查找头”以将位置带入L1，因此在L1中不会遗漏的额外存储会损害并发性，因为前瞻不能看到会丢失的许多请求。
Perhaps there are conflicts for L1 and L2 resources like read and write ports, inter-cache bandwidth, that are worse with this pattern of stores. 也许存在L1和L2资源（如读取和写入端口，缓存间带宽）的冲突，这些存储模式更糟糕。 For example when stores to different lines interleave, maybe they cannot drain as quickly from the store queue (see above where it appears that in some scenarios more than one store may drain per cycle). 例如，当存储到不同行的存储时，它们可能无法从存储队列中快速消耗（参见上面的情况，在某些情况下，每个周期可能会消耗多个存储）。

These comments by Dr. McCalpin on the Intel forums are also quite interesting. McCalpin博士在英特尔论坛上发表的这些评论也非常有趣。

⁰ Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles. ⁰大多数情况下只有在禁用L2流传输时才能实现，否则L2上的额外争用会将此速度降低到每3.5个周期约1行。

¹ Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. ¹将此与商店进行对比，其中每个负载几乎完全达到1.5个周期，每个周期的隐含带宽约为43个字节。 This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions. 这使得完美感：L1 < - > L2带宽为64个字节，但假定L1 或者是接受从芯的L2 或维修负载请求线上的每个周期（但不能同时并行地）则具有3个周期对于两个负载到不同的L2线：2个周期接受L2的线路，1个周期满足两个负载指令。

² With prefetching off . ²使用预取关闭。 As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability. 事实证明，L2预取程序在检测到流式访问时会竞争对L2缓存的访问：即使它总是找到候选行并且没有进入L3，这会减慢代码并增加可变性。 The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability). 结论一般都是预取的，但是一切都只是稍微慢一点（这里有预取的大量结果 - 你看到每个负载大约3.3个周期，但有很多变化）。

³ It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. ³它甚至不需要提前 - 预取后面的几行也可以工作：我想预取/加载只是快速运行在瓶颈的商店之前，所以他们无论如何都要先行。 In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in. 通过这种方式，预取是一种自我修复，并且几乎可以与你所使用的任何值一起使用。

Answer 2

Sandy Bridge has "L1 data hardware pre-fetchers". Sandy Bridge拥有“L1数据硬件预取器”。 What this means is that initially when you do your store the CPU has to fetch data from L2 into L1; 这意味着，最初当你做商店时，CPU必须从L2获取数据到L1; but after this has happened several times the hardware pre-fetcher notices the nice sequential pattern and starts pre-fetching data from L2 into L1 for you, so that the data is either in L1 or "half way to L1" before your code does its store. 但是经过多次发生这种情况后，硬件预取器会注意到良好的顺序模式，并开始将L2中的数据从L2预取到L1中，这样数据在代码执行之前就处于L1或“L1的一半”。商店。

英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能

问题描述

Eliminated Possibilities 消除了可能性

toplev.py toplev.py

2 个解决方案

解决方案1
11 2017-12-20 09:12:18

Details and Pictures 细节和图片

64-byte Stride 64字节的步幅

解决方案2
0 2017-12-17 08:35:58

英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能

问题描述

Eliminated Possibilities 消除了可能性

toplev.py toplev.py

2 个解决方案

解决方案1 11 2017-12-20 09:12:18

Details and Pictures 细节和图片

64-byte Stride 64字节的步幅

解决方案2 0 2017-12-17 08:35:58

解决方案1
11 2017-12-20 09:12:18

解决方案2
0 2017-12-17 08:35:58