简体   繁体   English

现代x86实现能否从多个以前的存储中进行存储?

[英]Can modern x86 implementations store-forward from more than one prior store?

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load? 如果负载与两个较早的存储区重叠(并且该负载未完全包含在最旧的存储区中),现代的Intel或AMD x86实现是否可以从两个存储区转发来满足负载?

For example, consider the following sequence: 例如,考虑以下顺序:

mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]

The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. 最终的2字节加载从紧接的前一个存储中获取其第二个字节,但在此之前从存储中获取第一个字节。 Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1? 可以将此存储转发给存储,还是需要等到两个先前的存储都提交到L1时才能进行?

Note that by store-forwarding here I'm including any mechanism that can satisfy the reads from stores still in the store buffer, rather than waiting them to commit to L1, even if it is a slower path than the best case "forwards from a single store" case. 请注意,通过此处的存储转发 ,我将包括可以满足仍在存储缓冲区中的存储读取的任何机制,而不是等待它们提交到L1的机制,即使这是比“从A转发”的最佳情况慢的路径。单店”案例。

No. 没有。

At least, not on Haswell, Broadwell or Skylake processors. 至少不是在Haswell,Broadwell或Skylake处理器上。 On other Intel processors, the restrictions are either similar (Sandy Bridge, Ivy Bridge) or even tighter (Nehalem, Westmere, Pentium Pro/II/II/4). 在其他Intel处理器上,限制是相似的(Sandy Bridge,Ivy Bridge)或更严格的限制(Nehalem,Westmere,Pentium Pro / II / II / 4)。 On AMD, similar limitations apply. 在AMD上,适用类似的限制。

From Agner Fog's excellent optimization manuals : 来自Agner Fog出色的优化手册

Haswell/Broadwell Haswell / Broadwell

The microarchitecture of Intel and AMD CPUs 英特尔和AMD CPU的微体系结构

§ 10.12 Store forwarding stalls §10.12商店转运摊位

The processor can forward a memory write to a subsequent read from the same address under certain conditions. 在某些条件下,处理器可以将内存写入转发给来自同一地址的后续读取。 Store forwarding works in the following cases: 在以下情况下,商店转发有效:

  • When a write of 64 bits or less is followed by a read of the same size and the same address, regardless of alignment. 写入64位或更少后,随后进行大小和地址相同的读取,无论对齐方式如何。
  • When a write of 128 or 256 bits is followed by a read of the same size and the same address, fully aligned. 写入128或256位后,再进行相同大小和相同地址的读取,完全对齐。
  • When a write of 64 bits or less is followed by a read of a smaller size which is fully contained in the write address range, regardless of alignment. 当写入64位或更少的位时,随后进行较小大小的读取,该大小将完全包含在写入地址范围内,而与对齐方式无关。
  • When an aligned write of any size is followed by two reads of the two halves, or four reads of the four quarters, etc. with their natural alignment within the write address range. 当任何大小的对齐写入之后,是两个一半的读取两次或四个季度的四个读取等,其自然对齐在写入地址范围内。
  • When an aligned write of 128 bits or 256 bits is followed by a read of 64 bits or less that does not cross an 8 bytes boundary. 当对齐的128位或256位写入之后是64位或更少的读取,且不会越过8字节边界。

A delay of 2 clocks occur if the memory block crosses a 64-bytes cache line boundary. 如果存储块越过64字节的高速缓存行边界,则会发生2个时钟的延迟。 This can be avoided if all data have their natural alignment. 如果所有数据都具有自然对齐方式,则可以避免这种情况。

Store forwarding fails in the following cases: 在以下情况下,商店转发失败:

  • When a write of any size is followed by a read of a larger size 当任何大小的写入后跟着较大的读取
  • When a write of any size is followed by a partially overlapping read 当任何大小的写入后跟部分重叠的读取
  • When a write of 128 bits is followed by a smaller read crossing the boundary between the two 64-bit halves 当写入128位之后是较小的读取时,越过两个64位半部分之间的边界
  • When a write of 256 bits is followed by a 128 bit read crossing the boundary between the two 128-bit halves 当写入256位之后是读取128位时,越过两个128位半部之间的边界
  • When a write of 256 bits is followed by a read of 64 bits or less crossing any boundary between the four 64-bit quarters 当写入256位之后是读取64位以下的字符时,越过四个64位四分之一之间的任何边界

A failed store forwarding takes 10 clock cycles more than a successful store forwarding. 失败的商店转发比成功的商店转发花费10个时钟周期。 The penalty is much higher - approximately 50 clock cycles - after a write of 128 or 256 bits which is not aligned by at least 16. 在写入未对齐至少16位的128或256位之后,代价要高得多-大约50个时钟周期。

Emphasis added 重点已添加

Skylake 天空湖

The microarchitecture of Intel and AMD CPUs 英特尔和AMD CPU的微体系结构

§ 11.12 Store forwarding stalls 第11.12条商店转运摊位

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. 在某些情况下,Skylake处理器可以将内存写入转发给来自同一地址的后续读取。 Store forwarding is one clock cycle faster than on previous processors. 存储转发比以前的处理器快一个时钟周期。 A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes. 对于32或64位的操作数,在最佳情况下,写存储器后再从同一地址进行读取需要4个时钟周期,而对于其他操作数大小,则需要5个时钟周期。

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned. 当128或256位的操作数未对齐时,存储转发将增加多达3个时钟周期的损失。

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, ie an address divisible by 64 bytes. 当任何大小的操作数越过高速缓存行边界(即可被64字节整除的地址)时,存储转发通常会额外花费4-5个时钟周期。

A write followed by a smaller read from the same address has little or no penalty. 写入后再从相同地址进行较小的读取几乎不会造成任何损失。

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write. 当读偏移量但完全包含在写所覆盖的地址范围内时,写64位或更少的位,然后进行较小的读,将产生1-3个时钟的损失。

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. 对齐写入128或256位,然后读取两个半部或四个四分之一等中的一个或两个,等等,几乎没有损失。 A partial read that does not fit into the halves or quarters can take 11 clock cycles extra. 不适合两半或四分之一的部分读取可能会额外花费11个时钟周期。

A read that is bigger than the write, or a read that covers both written and unwritten bytes , takes approximately 11 clock cycles extra. 大于写入的读取 ,或覆盖已写入和未写入字节的读取,大约需要11个时钟周期。

Emphasis added 重点已添加

In General: 一般来说:

A common point across microarchitectures that Agner Fog's document points out is that store forwarding is more likely to happen if the write was aligned and the reads are halves or quarters of the written value. Agner Fog的文档指出,跨微体系结构的一个共同点是,如果对齐写入并且读取的值是写入值的一半四分之一 ,则存储转发的可能性更大。

A Test 一个测试

A test with the following tight loop: 具有以下紧密循环的测试:

mov [rsp-16], eax
mov [rsp-12], ebx
mov ecx, [rsp-15]

Shows that the ld_blocks.store_forward PMU counter does indeed increment. 显示ld_blocks.store_forward PMU计数器确实确实增加了。 This event is documented as follows: 该事件记录如下:

ld_blocks.store_forward [This event counts how many times the load operation got the true Block-on-Store blocking code preventing store forwarding. ld_blocks.store_forward [此事件计算加载操作获得阻止存储转发的真正的存储区阻止代码的次数。 This includes cases when: - preceding store conflicts with the load (incomplete overlap) 这包括以下情况:-先前存储与负载冲突(重叠不完全)

  • store forwarding is impossible due to u-arch limitations 由于u-arch限制,无法进行商店转发

  • preceding lock RMW operations are not forwarded 先前的锁定RMW操作不会转发

  • store has the no-forward bit set (uncacheable/page-split/masked stores) 存储设置了无转发位(不可缓存/页面拆分/屏蔽的存储)

  • all-blocking stores are used (mostly, fences and port I/O) 使用所有阻塞存储(主要是围栏和端口I / O)

This indicates that the store-forwarding does indeed fail when a read only partially overlaps the most recent earlier store (even if it is fully contained when even earlier stores are considered). 这表明当只读仅部分重叠最近的较早存储时(即使考虑了较早的存储也完全包含了该存储),存储转发确实会失败。

In-order Atom may be able to do this store-forwarding without stalling at all. 有序的Atom可能能够执行此存储转发而完全不会停止。

Agner Fog doesn't mention this case specifically for Atom, but unlike all other CPUs, it can store-forward with 1c latency from a store to a wider or differently-aligned load. Agner Fog没有特别针对Atom提到这种情况,但是与所有其他CPU不同,它可以以1c的延迟将存储从存储转发到更大或更不同的负载。 The only exception Agner found was on cache-line boundaries, where Atom is horrible (16 cycle penalty for a CL-split load or store, even when store-forwarding isn't involved). Agner发现的唯一例外是在高速缓存行边界上,其中Atom太可怕了(即使不涉及存储转发,CL拆分加载或存储也要付出16个周期的惩罚)。


Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1? 可以将此存储转发给存储,还是需要等到两个先前的存储都提交到L1时才能进行?

There's a terminology issue here. 这里有一个术语问题。 Many people will interpret "Can this load be store-forwarded" as asking if it can happen with as low latency as when all the requirements are met for fast-path store-forwarding, as listed in @IWill's answer. 如@IWill的答案中所列出的,许多人会把“是否可以将此存储转发给存储”解释为,是否可以以与满足快速路径存储转发的所有要求一样低的延迟发生。 (Where all the loaded data comes from the most recent store to overlap any of the load, and other relative/absolute alignment rules are met). (如果所有加载的数据都来自最近的存储,以与任何加载重叠,并且满足其他相对/绝对对齐规则)。

I thought at first that you were missing the third possibility, of slower but still (nearly?) fixed latency forwarding without waiting for commit to L1D, eg with a mechanism that scrapes the whole store buffer (and maybe loads from L1D) in cases that Agner Fog and Intel's optimization manual call "store forwarding failure". 起初我以为您错过了第三种可能性,即较慢但仍(几乎?)固定延迟转发而不等待提交给L1D的可能性,例如,在某些情况下采用了一种刮除整个存储缓冲区(可能是L1D的负载)的机制。 Agner Fog和Intel的优化手册称“商店转发失败”。

But now I see this wording was intentional, and you really do want to ask whether or not the third option exists. 但是现在我看到这个措词是故意的,您确实想问第三个选项是否存在。

You might want to edit some of this into your question. 您可能需要将其中一些内容修改为您的问题。 In summary, the three likely options for Intel x86 CPUs are: 总之,英特尔x86 CPU的三个可能的选择是:

  1. Intel/Agner definition of store-forwarding success, where all the data comes from only one recent store with low and ( nearly) fixed latency. 英特尔/阿格纳(Intel / Agner)对存储转发成功的定义,其中所有数据仅来自最近一家具有低延迟( 几乎)固定延迟的商店。
  2. Extra (but limited) latency to scan the whole store buffer and assemble the correct bytes (according to program-order), and (if necessary or always?) load from L1D to provide data for any bytes that weren't recently stored. 扫描整个存储缓冲区并组装正确的字节(根据程序顺序)需要额外(但有限)的延迟,并且(如果有必要或始终是?)从L1D加载以提供最近未存储的任何字节的数据。

    This is the option we aren't sure exists . 这是我们不确定存在的选项

    It also has to wait for all data from store-data uops that don't have their inputs ready yet, since it has to respect program order. 它还必须等待尚未准备好输入的存储数据块中的所有数据,因为它必须遵守程序顺序。 There may be some information published about speculative execution with unknown store-address (eg guessing that they don't overlap), but I forget. 可能会发布一些有关具有未知存储地址的推测执行的信息(例如,猜测它们没有重叠),但是我忘记了。

  3. Wait for all overlapping stores to commit to L1D, then load from L1D. 等待所有重叠的存储提交到L1D,然后从L1D加载。

    Some real x86 CPUs might fall back to this in some cases, but they might always use option 2 without introducing a StoreLoad barrier. 在某些情况下,某些真正的x86 CPU可能会退一步,但它们可能始终使用选项2,而不会引入StoreLoad障碍。 (Remember that x86 stores have to commit in program order, and loads have to happen in program order. This would effectively drain the store buffer to this point, like mfence , although later loads to other addresses could still speculatively store-forward or just take data from L1D.) (请记住,x86存储必须按程序顺序进行提交,并且加载必须按程序顺序进行。这将有效地耗尽存储缓冲区到这一点,例如mfence ,尽管以后对其他地址的加载仍可以推测性地进行存储转发或仅采用来自L1D的数据。)


Evidence for the middle option: 中间选项的证据:

The locking scheme proposed in Can x86 reorder a narrow store with a wider load that fully contains it? x86中提出的锁定方案是否可以对一个狭窄的商店重新排序,使其具有更大的负载并完全包含它? would work if store-forwarding failure required a flush to L1D. 如果存储转发失败需要刷新到L1D,它将正常工作。 Since it doesn't work on real hardware without mfence , that's strong evidence that real x86 CPUs are merging data from the store buffer with data from L1D. 由于没有mfence ,它不能在实际的硬件上mfence ,因此有力的证据表明,真正的x86 CPU正在将存储缓冲区中的数据与L1D中的数据合并。 So option 2 exists and is used in this case. 因此,选项2存在并在这种情况下使用。

See also Linus Torvalds' explanation that x86 really does allow this kind of reordering , in response to someone else who proposed the same locking idea as that SO question. 另请参阅Linus Torvalds对x86确实允许这种重新排序的解释 ,以回应提出与该SO问题相同的锁定思想的其他人。

I haven't tested if store-forwarding failure/stall penalties are variable, but if not that strongly implies that it falls back to checking the whole store buffer when the best-case forwarding doesn't work. 我尚未测试存储转发失败/停顿处罚是否可变,但是如果不是最好的话,则强烈暗示当最佳情况转发不起作用时,它会退回到检查整个存储缓冲区。

Hopefully someone will answer What are the costs of failed store-to-load forwarding on x86? 希望有人能回答在x86上存储到装载转发失败的代价是什么? , which asks exactly that. ,正是这样问。 I will if I get around to it. 我会解决的。

Agner Fog only ever mentions a single number for store-forwarding penalties, and doesn't say it's bigger if cache-miss stores are in flight ahead of the stores that failed to forward. Agner Fog只提到了一个单一的数字来表示存储转发的罚款,并且没有说如果高速缓存未命中的存储比未转发的存储要早的话会更大。 (This would cause a big delay, because stores have to commit to L1D in order because of x86's strongly-ordered memory model.) He also doesn't say anything about it being different cases where data comes from 1 store + L1D vs. from parts of two or more stores, so I'd guess that it works in this case, too. (这将导致很大的延迟,因为由于x86的顺序内存强,存储必须按顺序提交到L1D。)他也没有说任何关于数据来自1个存储+ L1D与来自1个存储的不同情况。两个或两个以上商店的一部分,因此我猜想它在这种情况下也适用。


I suspect that "failed" store-forwarding is common enough that it's worth the transistors to handle it faster than just flushing the store queue and reloading from L1D. 我怀疑“失败”的存储转发非常普遍,以至于晶体管值得处理的速度要比刷新存储队列和从L1D重新加载更快。

For example, gcc doesn't specifically try to avoid store-forwarding stalls , and some of its idioms cause them (eg __m128i v = _mm_set_epi64x(a, b); in 32-bit code stores/reloads to the stack, which is already the wrong strategy on most CPUs for most cases, hence that bug report). 例如, gcc并没有专门尝试避免存储转发停顿 ,它的某些惯用法会导致它们(例如__m128i v = _mm_set_epi64x(a, b);在32位代码中已经存储/重载到堆栈中了)大多数情况下,大多数CPU上的策略错误,因此会报告该错误)。 It's not good, but the results aren't usually catastrophic, AFAIK. 效果不好,但结果通常不会带来灾难性的AFAIK。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 现代x86成本模型 - Modern x86 cost model 端口7可以在最新的Intel x86上存储AGU的哪种类型的地址? - What type of addresses can the port 7 store AGU handle on recent Intel x86? x86 当数据位于 2 个不同的块中时存储 - x86 store when data is in 2 different blocks 在现代x86系统上,堆栈浮点数运算是否比堆浮点运算更快? - Are stack float array ops faster than heap float ops on modern x86 systems? 在现代x86上有哪些方法可以有效地扩展指令长度? - What methods can be used to efficiently extend instruction length on modern x86? x86 rep 指令在现代(流水线/超标量)处理器上的性能 - Performance of x86 rep instructions on modern (pipelined/superscalar) processors 如何测量现代 x86 上经过的实际时钟周期数? - How to measure the ACTUAL number of clock cycles elapsed on modern x86? 在现代 x86 CPU 上将 C 中的两个整数相乘的正确 O(n log n) 时间复杂度? - correct O(n log n) time complexity for multiplying two integers in C on modern x86 CPUs? 是否存在任何现代CPU,其中缓存的字节存储实际上比字存储慢? - Are there any modern CPUs where a cached byte store is actually slower than a word store? 为什么为x64平台编译的c ++程序比为x86编译慢? - Why c++ program compiled for x64 platform is slower than compiled for x86?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM