简体   繁体   English

GCC内存屏障__sync_synchronize vs asm volatile(“”:::“memory”)

[英]GCC memory barrier __sync_synchronize vs asm volatile(“”: : :“memory”)

asm volatile("": : :"memory") is often used as a memory barrier (eg as seen in the Linux kernel barrier macro). asm volatile("": : :"memory")通常用作内存屏障(例如,在Linux内核barrier宏中看到)。

This sounds similar to what the GCC builtin __sync_synchronize does. 这听起来类似于GCC内置__sync_synchronize所做的__sync_synchronize

Are these two similar? 这两个相似吗?

If not, what are the differences, and when would one used over the other ? 如果没有,有什么区别,何时使用另一个?

There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. 有一个显着的区别 - 第一个选项(内联asm)在运行时实际上什么也没做,那里没有执行命令,CPU也不知道它。 it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. 它只在编译时服务,告诉编译器不要在此点(任何方向)上移动加载或存储作为其优化的一部分。 It's called a SW barrier. 它被称为SW屏障。

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. 第二个屏障(内置同步)将简单地转换为硬件屏障,如果您使用的是x86,则可能是围栏(mfence / sfence)操作,或者其他架构中的等效操作。 The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point. CPU也可以在运行时进行各种优化,最重要的是实际执行无序操作 - 该指令告诉它确保加载或存储不能通过这一点,并且必须在正确的一侧观察同步点。

Here's another good explanation: 这是另一个很好的解释:

Types of Memory Barriers 内存障碍的类型

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. 如上所述,编译器和处理器都可以以需要使用存储器屏障的方式优化指令的执行。 A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier. 影响编译器和处理器的存储器屏障是硬件存储器屏障,并且仅影响编译器的存储器屏障是软件存储器屏障。

In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. 除硬件和软件内存屏障外,内存屏障可以限制为内存读取,内存写入或两者兼而有之。 A memory barrier that affects both reads and writes is a full memory barrier. 影响读取和写入的内存屏障是完整的内存屏障。

There is also a class of memory barrier that is specific to multi-processor environments. 还有一类特定于多处理器环境的内存屏障。 The name of these memory barriers are prefixed with "smp". 这些内存屏障的名称以“smp”为前缀。 On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers. 在多处理器系统上,这些障碍是硬件内存障碍,而在单处理器系统上,它们是软件内存障碍。

The barrier() macro is the only software memory barrier, and it is a full memory barrier. barrier()宏是唯一的软件内存屏障,它是一个完整的内存屏障。 All other memory barriers in the Linux kernel are hardware barriers. Linux内核中的所有其他内存障碍都是硬件障碍。 A hardware memory barrier is an implied software barrier. 硬件内存屏障是隐含的软件障碍。

An example for when SW barrier is useful: consider the following code - SW屏障有用的示例:请考虑以下代码 -

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. 这个简单的循环,使用优化编译,很可能会展开和矢量化。 Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations: 这里是汇编代码gcc 4.8.0 -O3生成的打包(vector)操作:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    $0x10,%rax
400428:       66 0f fe c1             paddd  %xmm1,%xmm0
40042c:       66 0f 7f 40 f0          movdqa %xmm0,0xfffffffffffffff0(%rax)
400431:       48 39 d0                cmp    %rdx,%rax
400434:       75 ea                   jne    400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop: 但是,在每次迭代时添加内联汇编时,不允许gcc更改通过屏障的操作顺序,因此它不能对它们进行分组,并且程序集将成为循环的标量版本:

400418:       83 00 01                addl   $0x1,(%rax)
40041b:       48 83 c0 04             add    $0x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. 但是,当CPU执行此代码时,只要它不破坏内存排序模型,就允许对“引擎盖下”的操作进行重新排序。 This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). 这意味着执行操作可以不按顺序进行(如果CPU支持这种操作,那么这些日子大多数都是这样做的)。 A HW fence would have prevented that. HW围栏会阻止它。

A comment on the usefulness of SW-only barriers: 评论仅限SW的障碍的有用性:

On some micro-controllers, and other embedded platforms, you may have multitasking, but no cache system or cache latency, and hence no HW barrier instructions. 在某些微控制器和其他嵌入式平台上,您可能具有多任务处理,但没有缓存系统或缓存延迟,因此没有硬件屏障指令。 So you need to do things like SW spin-locks. 所以你需要做SW旋转锁之类的事情。 The SW barrier prevents compiler optimizations (read/write combining and reordering) in these algorithms. SW屏障可防止这些算法中的编译器优化(读/写组合和重新排序)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM