[英]In c++, can we guarantee a happen-before between two threads by volatile + memory fence (sfence+lfence)?
Briefly speaking, can the data stored in src
be correctly copied to dst
, in the following code?简单来说,下面的代码中
src
存储的数据能否正确复制到dst
中?
volatile bool flag = false;
// In thread A.
memset(mid, src, size);
__asm__ __volatile__("sfence" ::: "memory");
flag = true;
// In thread B.
while (flag == false);
__asm__ __volatile__("lfence" ::: "memory");
memset(dst, mid, size);
Don't use this code in practice, use std::atomic<bool>
with memory_order_release
and acquire
to get the same asm code-gen (but without the unnecessary lfence and sfence)不要在实践中使用此代码,使用
std::atomic<bool>
和memory_order_release
并acquire
相同的 asm 代码生成(但没有不必要的 lfence 和 sfence)
But yes, this looks safe , for compilers that define the behaviour of volatile
such that data-race UB on the volatile bool flag
isn't a problem.但是,是的,这看起来是安全的,对于定义
volatile
行为的编译器,以便volatile bool flag
上的数据竞争 UB 不是问题。 This is the case for compilers like GCC that can compile the Linux kernel (which rolls its own atomics using volatile
like you're doing).这就是像 GCC 这样可以编译 Linux 内核的编译器的情况(它像你一样使用
volatile
滚动自己的原子)。
ISO C++ doesn't strictly require this, for example a hypothetical implementation might exist on a machine without coherent shared memory, so atomic stores would require explicit flushing. ISO C++ 并不严格要求这样做,例如假设的实现可能存在于没有一致共享内存的机器上,因此原子存储将需要显式刷新。 But in practice there aren't any such implementations.
但实际上没有任何这样的实现。 (There are some embedded systems where
volatile
stores use different or extra instructions to make MMIO work, though.) (不过,在一些嵌入式系统中,
volatile
存储使用不同或额外的指令来使 MMIO 工作。)
A barrier before a store makes it a release store, and a barrier after a load makes it an acquire load.存储前的屏障使其成为释放存储,加载后的屏障使其成为获取加载。https://preshing.com/20120913/acquire-and-release-semantics/ .
https://preshing.com/20120913/acquire-and-release-semantics/ 。 Happens Before can be established with just a release store seen by an acquire load.
Happens Before 可以通过获取负载看到的发布存储来建立。
The x86 asm memory model already forbids all reordering except StoreLoad, so only compile-time reordering needs to be blocks. x86 asm 内存模型已经禁止除了 StoreLoad 之外的所有重新排序,因此只有编译时重新排序需要是块。 This will compile to asm that's the same as what you'd get from using
std::atomic<bool>
with mo_release
and mo_acquire
, except for those inefficient LFENCE and SFENCE instructions.这将编译为 asm,这与将
std::atomic<bool>
与mo_release
和mo_acquire
一起使用所得到的结果相同,除了那些低效的 LFENCE 和 SFENCE 指令。
C++ How is release-and-acquire achieved on x86 only using MOV? C++ 如何仅使用 MOV 在 x86 上实现释放和获取? explains why the x86 asm memory model is at least as strong as acq_rel.
解释了为什么 x86 asm 内存模型至少与 acq_rel 一样强大。
The sfence
and lfence
instructions inside the asm statements are totally irrelevant , only the asm("" ::: "memory")
compiler barrier part is needed. asm 语句中的
sfence
和lfence
指令完全无关,只需要asm("" ::: "memory")
编译器屏障部分。 https://preshing.com/20120625/memory-ordering-at-compile-time/ . https://preshing.com/20120625/memory-ordering-at-compile-time/ 。 Compile-time reordering only has to respect the C++ memory model, but whatever the compiler picks is then nailed down by the x86 memory model.
编译时重新排序只需要遵守 C++ 内存模型,但是编译器选择的任何内容都由 x86 内存模型确定。 (Program-order + store buffer with store forwarding = slightly stronger than acq_rel)
(程序顺序+带存储转发的存储缓冲区=略强于acq_rel)
(A GNU C asm
statement with no output operands is implicitly volatile so I'm omitting the explicit volatile
.) (没有输出操作数的 GNU C
asm
语句是隐式 volatile 所以我省略了显式volatile
。)
(Unless you're trying to synchronize NT stores? If so you only need sfence
, not lfence
.) Does the Intel Memory Model make SFENCE and LFENCE redundant? (除非您尝试同步 NT 存储?如果是这样,您只需要
sfence
,而不是lfence
。) 英特尔内存模型是否使 SFENCE 和 LFENCE 变得多余? yes.是的。 A memset that internally uses NT stores will use
sfence
itself, to make itself compatible with the standard C++ atomics / ordering -> asm mapping used on x86.内部使用 NT 存储的 memset 将使用
sfence
本身,以使其与 x86 上使用的标准 C++ 原子/排序 -> asm 映射兼容。 If you use a different mapping (like freely using NT stores without sfence), you could in theory break mutex critical sections unless you roll your own mutexes, too.如果您使用不同的映射(例如在没有 sfence 的情况下自由使用 NT 存储),理论上您可以破坏互斥锁临界区,除非您也推出自己的互斥锁。 (In practice most mutex implementations use a
lock
ed instruction in take and release, which is a full barrier.) (实际上,大多数互斥体实现在 take 和 release 中使用
lock
ed 指令,这是一个完整的屏障。)
An empty asm statement with a memory clobber is sort of a roll-your-own equivalent to atomic_thread_fence(std::memory_order_acquire_release)
because of x86's memory model.由于 x86 的内存模型,带有内存破坏器的空 asm 语句有点
atomic_thread_fence(std::memory_order_acquire_release)
的滚动你自己的。 atomic_thread_fence(acq_rel)
will compile to zero asm instructions, just blocking compile-time reordering. atomic_thread_fence(acq_rel)
将编译为零 asm 指令,只是阻止编译时重新排序。
Only seq_cst thread fence needs to emit any asm instructions to flush the store buffer and wait for that to happen before any later loads.只有 seq_cst 线程栅栏需要发出任何 asm 指令来刷新存储缓冲区,并在任何后续加载之前等待它发生。 aka a full barrier (like
mfence
or a lock
ed instruction like lock add qword ptr [rsp], 0
).又名一个完整屏障(如
mfence
或lock
等编指令lock add qword ptr [rsp], 0
)。
volatile
and inline asmvolatile
和内联 asm 滚动你自己的原子Yes, you can, and I hope you were just asking to understand how things work.是的,你可以,我希望你只是想了解事情是如何运作的。
You ended up making something much less efficient than it needed to be because you used lfence
(an out-of-order execution barrier that's essentially useless for memory ordering) instead of just a compiler barrier.由于您使用了
lfence
(一种乱序执行屏障,对于内存排序基本上无用),而不仅仅是编译器屏障,因此您最终使某些东西的效率远低于它所需的效率。 And an unnecessary sfence
.还有一个不必要的
sfence
。
See When should I use _mm_sfence _mm_lfence and _mm_mfence for basically the same problem but using intrinsics instead of inline asm.请参阅何时应该使用 _mm_sfence _mm_lfence 和 _mm_mfence解决基本相同的问题,但使用内在函数而不是内联 asm。 Generally you only want
_mm_sfence()
after NT-store intrinsics, and you should leave mfence
up to the compiler with std::atomic
.通常,您只需要
_mm_sfence()
在 NT 存储内在函数之后使用,并且您应该将mfence
留给编译器使用std::atomic
。
When to use volatile with multi threading? 什么时候在多线程中使用 volatile? - normally never;
- 通常从不; use
std::atomic
with mo_relaxed
instead of volatile
.使用
std::atomic
和mo_relaxed
而不是volatile
。
If you're asking about C++ memory model then the answer is no, your code is not thread-safe for multiple reasons:如果您问的是 C++ 内存模型,那么答案是否定的,出于多种原因,您的代码不是线程安全的:
flag
are not required to be atomic, even if it is marked as volatile
.flag
访问不需要是原子的,即使它被标记为volatile
。 This means that thread B may observe a value that was not stored in flag
, including a trap representation (ie a representation that corresponds to no valid value of bool
).flag
的值,包括陷阱表示(即与bool
无效值对应的表示)。 Using such a value may produce undefined behavior, for example, the observed value of flag
may not be equal to either true
or false
.flag
值可能不等于true
或false
。volatile
variable does not constitute a "happens-before" relation.volatile
变量不构成“发生在之前”的关系。 In other words, it is not a compiler or hardware fence, which allows the surrounding code to be reordered around the volatile
reads or writes by either the compiler or the CPU.volatile
读取或写入对周围代码进行重新排序。 Your attempt to introduce a fence with the asm block is not portable. Practically speaking, your code may produce a sequence of x86 instructions that will behave as you would expect.实际上,您的代码可能会生成一系列 x86 指令,这些指令的行为与您预期的一样。 This would be a pure coincidence given that:
鉴于以下情况,这纯属巧合:
sizeof(bool) == 1
on x86 on pretty much every OS, and storing and loading bytes is atomic on x86. sizeof(bool) == 1
在几乎所有操作系统上的 x86 上,并且在 x86 上存储和加载字节是原子的。 Note that there are platforms where sizeof(bool) > 1
and thus accessing it may not be atomic.sizeof(bool) > 1
,因此访问它可能不是原子的。__volatile__
qualifier prevents the compiler from reordering the asm block (and the fence it implements) with the surrounding code.__volatile__
限定符防止编译__volatile__
周围的代码重新排序 asm 块(及其实现的栅栏)。 This will make the asm blocks with hardware fences effective with that compiler and compatible ones. But I'll repeat, if the code works, then only by coincidence.但我要重复一遍,如果代码有效,那只是巧合。 It doesn't have to, even on x86, as the compiler is free to optimize this code as it wants to, since, as far as it is concerned, no thread concurrency is involved here.
即使在 x86 上,也不必这样做,因为编译器可以随意优化此代码,因为就它而言,此处不涉及线程并发性。 You may rely on guarantees provided by the specific compiler, such as non-standard semantics of
volatile
, intrinsics and asm blocks, but at that point your program is not portable C/C++ and is written for the specific compiler, possibly with a specific set of command line switches.您可能依赖特定编译器提供的保证,例如
volatile
、内部函数和 asm 块的非标准语义,但此时您的程序不是可移植的 C/C++ 并且是为特定编译器编写的,可能具有特定的集合命令行开关。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.