简体繁体 English

内存屏障/栅栏的开销

[英]Overhead of a Memory Barrier / Fence

原文 2009-11-29 11:24:06 4 2 c++

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. 我目前正在编写C ++代码，并在我的代码中使用了很多内存屏障/围栏。 I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. 我知道，MB告诉编译器和硬件不要重新排序它周围的写/读。 But i don't know how complex this operation is for the processor at runtime. 但我不知道这个操作在运行时对于处理器有多复杂。

My Question is: What is the runtime-overhead of such a barrier? 我的问题是：这种障碍的运行时开销是多少？ I didn't found any useful answer with google... Is the overhead negligible? 谷歌没有找到任何有用的答案......开销是否可以忽略不计？ Or leads heavy usage of MBs to serious performance problems? 或者导致MB的大量使用导致严重的性能问题？

Best regards. 最好的祝福。

2 个解决方案

Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. 与算术和“正常”指令相比，我理解这些指令非常昂贵，但没有数字来备份该声明。 I like jalf's answer by describing effects of the instructions, and would like to add a bit. 我喜欢jalf的回答，描述了指令的效果，并想补充一点。

There are in general a few different types of barriers, so understanding the differences could be helpful. 通常存在一些不同类型的障碍，因此理解差异可能会有所帮助。 A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). 例如在清除锁定字（例如ppc上的lwsync或ia64上的st4.rel）之前的互斥实现中需要像jalf所提到的那样的屏障。 All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed. 所有读取和写入都必须完成，并且只能执行管道中稍后没有内存访问且不依赖于正在进行的内存操作的指令。

Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). 另一种类型的障碍是在获取锁时在互斥实现中使用的类型（示例，ppc上的isync或ia64上的instr.acq）。 This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. 这会对将来的指令产生影响，因此如果预取了非依赖性负载，则必须将其丢弃。 Example: 例：

if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
   foo( pSharedMem->somethingElse ) ;
}

Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete. 如果没有获取障碍（借用ia64术语），如果在检查标记位检查完成之前，某些内容已成为寄存器，则程序可能会产生意外结果。

There is a third type of barrier, generally less used, and is required to enforce store load ordering. 存在第三种类型的屏障，通常较少使用，并且需要强制存储负载排序。 Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO). 这种排序强制执行指令的指令示例是，ppc（重量级同步）上的同步，ia64上的MF，sparc上的membar #storeload（即使对于TSO也是必需的）。

Using ia64 like pseudocode to illustrate, suppose one had 使用类似伪代码的ia64来说明，假设有一个

st4.rel
ld4.acq

without an mf in between one has no guarentee that the load follows the store. 没有mf介于两者之间没有保证负载跟随商店。 You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise. 你知道st4rel之前的加载和存储是在那个商店或“后续”加载之前完成的，但是那个加载或其他未来的加载（也许是非依赖的商店？）可以潜入，早先完成，因为什么都没有阻止除此以外。

Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section". 因为互斥实现很可能只在它们的实现中使用获取和释放障碍，所以我期望这样的可观察效果是锁定释放后的内存访问实际上有时可能发生在“仍处于关键部分”。

Try thinking about what the instruction does. 试着想一下教学的作用。 It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. 它不会使CPU在逻辑方面做任何复杂的事情，但它迫使它等到所有读写都被提交到主存储器。 So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes). 因此，成本实际上取决于访问主存储器的成本（以及未完成的读/写次数）。

Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much. 访问主内存通常非常昂贵（10-200个时钟周期），但从某种意义上说，这项工作必须在没有屏障的情况下完成，它可能只是通过同时执行其他指令来隐藏，所以你感觉不到这么多的成本。

It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule. 它还限制了CPU（和编译器）重新安排指令的能力，因此可能存在间接成本，因为附近的指令不能交错，否则可能产生更有效的执行调度。