[英]Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake
I'm seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that's always to the same location 1 , like this: 我看到一个简单的存储循环出乎意料地表现不佳,这个存储循环有两个存储:一个具有16字节的正向步长,另一个总是位于同一位置1 ,如下所示:
volatile uint32_t value;
void weirdo_cpp(size_t iters, uint32_t* output) {
uint32_t x = value;
uint32_t *rdx = output;
volatile uint32_t *rsi = output;
do {
*rdx = x;
*rsi = x;
rdx += 4; // 16 byte stride
} while (--iters > 0);
}
In assembly this loop probably 3 looks like: 在汇编这个循环可能3看起来像:
weirdo_cpp:
...
align 16
.top:
mov [rdx], eax ; stride 16
mov [rsi], eax ; never changes
add rdx, 16
dec rdi
jne .top
ret
When the memory region accessed is in L2 I would expect this to run at less than 3 cycles per iteration. 当访问的存储区域在L2中时,我希望每次迭代运行少于3个周期。 The second store just keeps hitting the same location and should add about a cycle.
第二个商店只是一直在同一个位置,应该添加一个周期。 The first store implies bringing in a line from L2 and hence also evicting a line once every 4 iterations .
第一个商店意味着从L2引入一条线,因此每4次迭代也会驱逐一条线。 I'm not sure how you evaluate the L2 cost, but even if you conservatively estimate that the L1 can only do one of the following every cycle: (a) commit a store or (b) receive a line from L2 or (c) evict a line to L2, you'd get something like 1 + 0.25 + 0.25 = 1.5 cycles for the stride-16 store stream.
我不确定你如何评估L2成本,但即使你保守估计L1只能在每个周期中执行以下操作之一:(a)提交商店或(b)从L2接收一行或(c)将一条线驱逐到L2,对于stride-16商店流,你会得到1 + 0.25 + 0.25 = 1.5个周期。
Indeed, you comment out one store you get ~1.25 cycles per iteration for the first store only, and ~1.01 cycles per iteration for the second store, so 2.5 cycles per iteration seems like a conservative estimate. 实际上,你注释掉一个商店你得到的第一个商店每次迭代约1.25个周期,第二个商店每个迭代约1.01个周期,所以每次迭代2.5个周期似乎是一个保守的估计。
The actual performance is very odd, however. 然而,实际表现非常奇怪。 Here's a typical run of the test harness:
这是测试工具的典型运行:
Estimated CPU speed: 2.60 GHz
output size : 64 KiB
output alignment: 32
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.89 cycles/iter, 1.49 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
4.73 cycles/iter, 1.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.34 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.26 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.31 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.27 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
Two things are weird here. 这里有两件事很奇怪。
First are the bimodal timings: there is a fast mode and a slow mode . 首先是双峰时序:有快速模式和慢速模式 。 We start out in slow mode taking about 7.3 cycles per iteration, and at some point transition to about 3.9 cycles per iteration.
我们从慢速模式开始,每次迭代大约需要7.3个周期,并且在某些时候每次迭代过渡到大约3.9个周期。 This behavior is consistent and reproducible and the two timings are always quite consistent clustered around the two values.
这种行为是一致的和可重复的,并且两个时间总是非常一致地聚集在两个值周围。 The transition shows up in both directions from slow mode to fast mode and the other way around (and sometimes multiple transitions in one run).
转换显示在从慢速模式到快速模式的两个方向上, 反之亦然 (有时在一次运行中有多个转换)。
The other weird thing is the really bad performance. 另一个奇怪的是表现非常糟糕。 Even in fast mode , at about 3.9 cycles the performance is much worse than the 1.0 + 1.3 = 2.3 cycles worst cast you'd expect from adding together the each of the cases with a single store (and assuming that absolutely zero worked can be overlapped when both stores are in the loop).
即使在快速模式下 ,在大约3.9个周期内,性能也比1.0 + 1.3 = 2.3周期差得多,您希望将每个案例与单个商店相加(假设绝对零工作可以重叠)当两个商店都在循环中)。 In slow mode , performance is terrible compared to what you'd expect based on first principles: it is taking 7.3 cycles to do 2 stores, and if you put it in L2 store bandwidth terms, that's roughly 29 cycles per L2 store (since we only store one full cache line every 4 iterations).
在慢速模式下 ,与基于第一原则所期望的相比,性能非常糟糕:它需要7.3个周期来完成2个存储,如果你把它放在L2存储带宽术语中,那么每个L2存储大约需要29个周期 (因为我们每4次迭代只存储一个完整的缓存行)。
Skylake is recorded as having a 64B/cycle throughput between L1 and L2, which is way higher than the observed throughput here (about 2 bytes/cycle in slow mode ). Skylake被记录为在L1和L2之间具有64B /循环吞吐量,这远高于此处观察到的吞吐量(在慢速模式下约2个字节/周期)。
What explains the poor throughput and bimodal performance and can I avoid it? 什么解释了吞吐量和双峰性能差,我可以避免它吗?
I'm also curious if this reproduces on other architectures and even on other Skylake boxes. 如果这在其他架构甚至其他Skylake盒子上再现,我也很好奇。 Feel free to include local results in the comments.
随意在评论中包含本地结果。
You can find the test code and harness on github . 您可以在github上找到测试代码和线束 。 There is a
Makefile
for Linux or Unix-like platforms, but it should be relatively easy to build on Windows too. 有一个适用于Linux或类Unix平台的
Makefile
,但在Windows上也应该相对容易。 If you want to run the asm
variant you'll need nasm
or yasm
for the assembly 4 - if you don't have that you can just try the C++ version. 如果你想运行
asm
变体,你需要使用nasm
或yasm
作为程序集4 - 如果你没有,你可以尝试C ++版本。
Here are some possibilities that I considered and largely eliminated. 以下是我考虑并在很大程度上消除的一些可能性。 Many of the possibilities are eliminated by the simple fact that you see the performance transition randomly in the middle of the benchmarking loop , when many things simply haven't changed (eg, if it was related to the output array alignment, it couldn't change in the middle of a run since the same buffer is used the entire time).
很多可能性都被简单的事实所消除,你可以在基准测试循环的中间随机看到性能转换,当许多事情根本没有改变时(例如,如果它与输出数组对齐有关,它就不能因为整个时间使用相同的缓冲区,所以在运行中间进行更改。 I'll refer to this as the default elimination below (even for things that are default elimination there is often another argument to be made).
我将在下面将其称为默认消除 (即使对于默认消除的事物,通常还会有另一个参数)。
stress -vm 4
). stress -vm 4
)。 The benchmark itself should be completely core-local anyways since it fits in L2, and perf
confirms there are very few L2 misses per iteration (about 1 miss every 300-400 iterations, probably related to the printf
code). perf
确认每次迭代很少有L2未命中(每300-400次迭代大约1次错过,可能与printf
代码有关)。 intel_pstate
in performance
mode. intel_pstate
在performance
模式下是intel_pstate
。 No frequency variations are observed during the test (CPU stays essentially locked at 2.59 GHz). perf
doesn't report any particularly weird TLB behavior. perf
没有报告任何特别奇怪的TLB行为。 I used toplev.py which implements Intel's Top Down analysis method, and to no surprise it identifies the benchmark as store bound: 我使用了toplev.py来实现英特尔的Top Down分析方法,毫不奇怪它将基准标识为存储绑定:
BE Backend_Bound: 82.11 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound: 59.64 % Slots [ 4.83%]
BE/Core Backend_Bound.Core_Bound: 22.47 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 0.03 % Stalls [ 4.92%]
This metric estimates how often the CPU was stalled without
loads missing the L1 data cache...
Sampling events: mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 74.91 % Stalls [ 4.96%] <==
This metric estimates how often CPU was stalled due to
store memory accesses...
Sampling events: mem_inst_retired.all_stores:pp
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.20 % Clocks [ 4.93%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized: 26.28 % CoreClocks [ 4.83%]
This metric represents Core cycles fraction where the CPU
executed total of 1 uop per cycle on all execution ports...
MUX: 4.65 %
PerfMon Event Multiplexing accuracy indicator
This doesn't really shed much light: we already knew it must be the stores messing things up, but why? 这并没有真正说明问题:我们已经知道必须要把商店搞砸了,但为什么呢? Intel's description of the condition doesn't say much.
英特尔对这种情况的描述并没有多说。
Here's a reasonable summary of some of the issues involved in L1-L2 interaction. 以下是 L1-L2交互中涉及的一些问题的合理总结。
Update Feb 2019: I cannot no longer reproduce the "bimodal" part of the performance: for me, on the same i7-6700HQ box, the performance is now always very slow in the same cases the slow and very slow bimodal performance applies, ie, with results around 16-20 cycles per line, like this: 更新2019年2月:我不能再重现性能的“双峰”部分:对于我来说,在相同的i7-6700HQ盒子上,在相同的情况下,性能现在总是很慢,双峰性能缓慢且非常慢,即,结果大约每行16-20个周期,如下所示:
This change seems to have been introduced in the August 2018 Skylake microcode update, revision 0xC6. 这一变化似乎已在2018年8月Skylake微码更新版本0xC6中引入。 The prior microcode, 0xC2 shows the original behavior described in the question.
先前的微码0xC2显示了问题中描述的原始行为。
1 This is a greatly simplified MCVE of my original loop, which was at least 3 times the size and which did lots of additional work, but exhibited exactly the same performance as this simple version, bottlenecked on the same mysterious issue. 1这是我原始循环的一个大大简化的MCVE,它至少是大小的3倍,并且做了很多额外的工作,但表现出与这个简单版本完全相同的性能,瓶颈在同一个神秘的问题上。
3 In particular, it looks exactly like this if you write the assembly by hand, or if you compile it with gcc -O1
(version 5.4.1), and probably most reasonable compilers ( volatile
is used to avoid sinking the mostly-dead second store outside the loop). 3特别是,它看起来就像这个,如果你手工编写的组件,或者如果你编译它
gcc -O1
(版本5.4.1),也可能是最合理的编译器( volatile
用来避免下沉大多死亡第二在循环外存储)。
4 No doubt you could convert this to MASM syntax with a few minor edits since the assembly is so trivial. 4毫无疑问,您可以通过一些小的编辑将其转换为MASM语法,因为程序集非常简单。 Pull requests accepted.
接受拉请求。
What I've found so far. 到目前为止我发现了什么。 Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it:
不幸的是,它并没有真正提供对性能不佳的解释,也没有为双峰分布提供解释,但更多的是一套规则,以便您何时可以看到性能和减轻它的注意事项:
The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, ie, one full cache line. 原始问题任意使用了16的步幅,但让我们从最简单的情况开始:64的步幅,即一个完整的缓存行。 As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables.
事实证明,任何步幅都可以看到各种效果,但64确保每次步幅都会出现L2缓存缺失,因此会删除一些变量。
Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory: 我们现在也要删除第二个商店 - 所以我们只测试64K内存的64字节跨步存储:
top:
mov BYTE PTR [rdx],al
add rdx,0x40
sub rdi,0x1
jne top
Running this in the same harness as above, I get about 3.05 cycles/store 2 , although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there). 在与上面相同的线束中运行它,我得到大约3.05个周期/存储2 ,尽管与我以前看到的相比有很多变化( - 你甚至可以在那里找到3.0)。
So we know already we probably aren't going to do better than this for sustained stores purely to L2 1 . 因此,我们已经知道了,我们可能都不会做得比这对于持续店更好纯属L2 1。 While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1.
虽然Skylake在L1和L2之间显然具有64字节的吞吐量,但在存储流的情况下,必须为从L1驱逐的两个驱动共享带宽,并将新线路加载到L1中。 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1.
3个循环似乎是合理的,如果它们分别需要1个循环来(a)将脏的受害者线从L1驱逐到L2(b)用来自L2的新行更新L1并且(c)将存储器提交到L1。
What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? 当你在循环中添加第二次写入同一个高速缓存行(到下一个字节,虽然结果不重要)时会发生什么? Like this:
像这样:
top:
mov BYTE PTR [rdx],al
mov BYTE PTR [rdx+0x1],al
add rdx,0x40
sub rdi,0x1
jne top
Here's a histogram of the timing for 1000 runs of the test harness for the above loop: 以下是上述循环的1000次测试工具运行时间的直方图:
count cycles/itr
1 3.0
51 3.1
5 3.2
5 3.3
12 3.4
733 3.5
139 3.6
22 3.7
2 3.8
11 4.0
16 4.1
1 4.3
2 4.4
So the majority of times are clustered around 3.5 cycles. 所以大多数时间都聚集在3.5个周期左右。 That means that this additional store only added 0.5 cycles to the timing.
这意味着这个额外的商店只增加了0.5个周期。 It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time.
这可能是类似于存储缓冲区能够将两个存储器排放到L1,如果它们在同一行中,但这只发生了大约一半的时间。
Consider that the store buffer contains a series of stores like 1, 1, 2, 2, 3, 3
where 1
indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. 考虑存储缓冲区包含一系列存储,如
1, 1, 2, 2, 3, 3
其中1
表示缓存行:一半位置具有来自同一缓存行的两个连续值,一半不具有。 As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1
maybe the stores drain in one cycle, but if it's at 1, 2
it takes two cycles. 由于存储缓冲区正在等待耗尽存储,并且L1忙于驱逐并接受来自L2的线路,因此L1将在“任意”点处可用于存储,并且如果它位于位置
1, 1
可能是存储在一个周期中耗尽,但如果它在1, 2
它需要两个周期。
Note there is another peak of about 6% of results around 3.1 rather than 3.5. 请注意,在3.1左右,而不是3.5左右,有另外一个峰值约为6%的结果。 That could be a steady state where we always get the lucky outcome.
这可能是一个稳定的状态,我们总能得到幸运的结果。 There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement.
在4.0-4.1时还有另外一个大约3%的峰值 - “总是不吉利”的安排。
Let's test this theory by looking at various offsets between the first and second stores: 让我们通过查看第一个和第二个商店之间的各种偏移来测试这个理论:
top:
mov BYTE PTR [rdx + FIRST],al
mov BYTE PTR [rdx + SECOND],al
add rdx,0x40
sub rdi,0x1
jne top
We try all values of FIRST
and SECOND
from 0 to 256 in steps of 8. The results, with varying FIRST
values on the vertical axis and SECOND
on the horizontal: 我们尝试
FIRST
和SECOND
所有值,从0到256,步长为8.结果,垂直轴上的FIRST
值和水平线上的SECOND
值不同:
We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). 我们看到一个特定的模式 - 白色值是“快速”(大约上面讨论的3.0-4.1值,偏移量为1)。 Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter).
黄色值较高,最多8个循环,红色最多10个。紫色异常值最高,通常是OP中描述的“慢速模式”开始的情况(通常以18.0个周期/秒计时)。 We notice the following:
我们注意到以下内容:
From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. 从白色单元格的模式来看,只要第二个存储位于同一缓存行或下一个存储位于第一个存储区中,我们就会看到快速~3.5周期结果。 This is consistent with the idea above that stores to the same cache line are handled more efficiently.
这与上面的想法一致,即更有效地处理对相同高速缓存行的存储。 The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access:
0, 0, 1, 1, 2, 2, ...
vs 0, 1, 1, 2, 2, ...
- where in the second case it is the second store that first touches each cache line. 在下一个缓存行中使用第二个存储的原因是该模式最终是相同的,除了第一个第一次访问:
0, 0, 1, 1, 2, 2, ...
vs 0, 1, 1, 2, 2, ...
- 在第二种情况下,它是第一个接触每个高速缓存行的第二个存储。 The store buffer doesn't care though. 但是商店缓冲区并不在意。 As soon as you get into different cache lines, you get a pattern like
0, 2, 1, 3, 2, ...
and apparently this sucks? 一旦你进入不同的缓存行,你会得到一个类似0,2,1,3,2
0, 2, 1, 3, 2, ...
显然这很糟糕?
The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles). 紫色的“异常值”永远不会出现在白色区域中,因此显然局限于已经很慢的场景(而且这里的慢点使得它慢大约2.5倍:从大约8到18个周期)。
We can zoom out a bit and look at even larger offsets: 我们可以缩小一点,看看更大的偏移量:
The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. 相同的基本模式,虽然我们看到性能提高(绿色区域),因为第二个存储距离越来越远(前面或后面)第一个,直到它在约1700字节的偏移处再次变坏。 Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5.
即使在改进的区域,我们也只能达到5.8次循环/迭代仍然比3.5的同线性能差得多。
If you add any kind of load or prefetch instruction that runs ahead 3 of the stores, both the overall slow performance and the "slow mode" outliers disappear: 如果添加任何类型的加载或预取指令,这些指令在3个存储区之前运行,则整体缓慢性能和“慢速模式”异常值都会消失:
You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops. 您可以将此端口移回原来的16步问题 - 核心循环中的任何类型的预取或加载,对距离非常不敏感(即使它实际上落后 ),修复问题并获得2.3周期/迭代,接近2.0的最佳理想值,并且等于具有单独循环的两个商店的总和。
So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. 因此,基本规则是没有相应负载的L2存储比软件预取它们要慢得多 - 除非整个存储流以单个顺序模式访问缓存行。 That's contrary to the idea that a linear pattern like this never benefits from SW prefetch.
这与像这样的线性模式永远不会受益于SW预取的想法相反。
I don't really have a fleshed out explanation, but it could include these factors: 我真的没有一个充实的解释,但它可能包括这些因素:
These comments by Dr. McCalpin on the Intel forums are also quite interesting. McCalpin博士在英特尔论坛上发表的这些评论也非常有趣。
0 Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles. 0大多数情况下只有在禁用L2流传输时才能实现,否则L2上的额外争用会将此速度降低到每3.5个周期约1行。
1 Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. 1将此与商店进行对比,其中每个负载几乎完全达到1.5个周期,每个周期的隐含带宽约为43个字节。 This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions.
这使得完美感:L1 < - > L2带宽为64个字节,但假定L1 或者是接受从芯的L2 或维修负载请求线上的每个周期(但不能同时并行地)则具有3个周期对于两个负载到不同的L2线:2个周期接受L2的线路,1个周期满足两个负载指令。
2 With prefetching off . 2使用预取关闭 。 As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability.
事实证明,L2预取程序在检测到流式访问时会竞争对L2缓存的访问:即使它总是找到候选行并且没有进入L3,这会减慢代码并增加可变性。 The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability).
结论一般都是预取的,但是一切都只是稍微慢一点(这里有预取的大量结果 - 你看到每个负载大约3.3个周期,但有很多变化)。
3 It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. 3它甚至不需要提前 - 预取后面的几行也可以工作:我想预取/加载只是快速运行在瓶颈的商店之前,所以他们无论如何都要先行。 In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in.
通过这种方式,预取是一种自我修复,并且几乎可以与你所使用的任何值一起使用。
Sandy Bridge has "L1 data hardware pre-fetchers". Sandy Bridge拥有“L1数据硬件预取器”。 What this means is that initially when you do your store the CPU has to fetch data from L2 into L1;
这意味着,最初当你做商店时,CPU必须从L2获取数据到L1; but after this has happened several times the hardware pre-fetcher notices the nice sequential pattern and starts pre-fetching data from L2 into L1 for you, so that the data is either in L1 or "half way to L1" before your code does its store.
但是经过多次发生这种情况后,硬件预取器会注意到良好的顺序模式,并开始将L2中的数据从L2预取到L1中,这样数据在代码执行之前就处于L1或“L1的一半”。商店。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.