简体   繁体   English

为什么商店装载障碍被认为是昂贵的?

[英]Why is a store-load barrier considered expensive?

Most CPU architectures will re-order stores-load operations, but my question is why? 大多数CPU架构都会重新订购存储加载操作,但我的问题是为什么? My interpretation of a store-load barrier would look like this: 我对商店装载障碍的解释如下:

x = 50;
store_load_barrier;
y = z;

Furthermore, I don't see how this barrier would be have much use in lock-free programming in comparison to release and acquire semantics. 此外,与发布和获取语义相比,我没有看到这种障碍如何在无锁编程中有多大用处。

Short Answer : The store-load barrier prevents the processor from speculatively executing LOAD that come after a store-load barrier until all previous stores have completed. 简答 :存储加载障碍可防止处理器在存储加载障碍之后推测性地执行LOAD,直到所有先前的存储完成为止。

Details : 细节

The reason that a store-load barrier is expensive is the it prevents the reordering of LOAD and STORE operations across the barrier. 存储装载屏障昂贵的原因是它阻止了跨越屏障的LOAD和STORE操作的重新排序。

Suppose you had an instruction sequence like the following: 假设您有一个如下所示的指令序列:

...             ;; long latency operation to compute r1
ST r1, [ADDR1]  ;; store value in r1 to memory location referenced by ADDR1
LD r3, [ADDR2]  ;; load r3 with value in memory location ADDR2
...             ;; instructions that use result in r3

When this sequence executes that the value of r1 will be the result of an operation that take a long time to complete. 执行此序列时, r1的值将是需要很长时间才能完成的操作的结果。 The instruction ST r1, [ADDR1] will have to stall until r1 is read In the meantime an out-of-order processor can speculatively execute the LD r3, [ADDR2] and other instructions if they are independent of the earlier store. 指令ST r1, [ADDR1]必须停止,直到读取r1为止。与此同时,无序处理器可以推测性地执行LD r3, [ADDR2]和其他指令,如果它们独立于早期存储。 They won't actually commit until the store is committed, but by doing most of the work speculatively the results can be saved in the reorder buffer and ready to commit more quickly. 它们实际上不会在提交存储之前提交,但通过推测性地完成大部分工作,结果可以保存在重新排序缓冲区中并准备更快地提交。

This works for a single-processor system because the CPU can check whether there are dependencies between ADDR1 and ADDR2. 这适用于单处理器系统,因为CPU可以检查ADDR1和ADDR2之间是否存在依赖关系。 But in an multiprocessor system multiple CPUs can independently executes loads and stores. 但是在多处理器系统中,多个CPU可以独立地执行加载和存储。 There might be multiple processors that are performing a ST to ADDR1 and a LD from ADDR2. 可能有多个处理器正在执行ST到ADDR1和从ADDR2执行LD。 If the CPUs are able to speculatively execute these instructions that don't appear to have dependencies then different CPUs might see different results. 如果CPU能够推测性地执行这些看起来没有依赖关系的指令,那么不同的CPU可能会看到不同的结果。 I think the following blog post gives a good explanation of how this can happen (I don't think it's something I could summarize succinctly in this answer). 我认为以下博客文章很好地解释了这是如何发生的(我认为这不是我能在这个答案中简明扼要地概括的内容)。

Now consider the code sequence that has a store-load barrier: 现在考虑具有存储负载障碍的代码序列:

...             ;; long latency operation to compute r1
ST r1, [ADDR1]  ;; store value in r1 to memory location referenced by ADDR1
ST_LD_BARRIER   ;; store-load barrier
LD r3, [ADDR2]  ;; load r3 with value in memory location ADDR2
...             ;; instructions that use result in r3

This would prevent the LD r3, [ADDR2] instruction and following dependent instructions from being speculatively executed until the previous store instructions complete. 这将阻止LD r3, [ADDR2]指令和随后的相关指令被推测性地执行,直到先前的存储指令完成。 And this could reduce the CPU performance because entire CPU pipeline might have to stall while waiting for the ST instruction to complete, even though in the CPU itself there is no dependency between the LD and the ST. 这可能会降低CPU性能,因为整个CPU流水线可能必须在等待ST指令完成时停止,即使在CPU本身中LD和ST之间没有依赖性。

There are some things that can be done to limit the amount that the CPU has to stall. 有些事情可以用来限制CPU停止的数量。 But the bottom line is that the store-load barrier creates additional dependencies between loads and stores and this limits the amount of speculative execution that the CPU can perform. 但最重要的是,存储加载障碍会在加载和存储之间创建额外的依赖关系,这限制了CPU可以执行的推测执行量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM