简体   繁体   English

在 C++ 中,我们可以通过 volatile + 内存栅栏(sfence + lfence)来保证两个线程之间发生之前发生吗?

[英]In c++, can we guarantee a happen-before between two threads by volatile + memory fence (sfence+lfence)?

Briefly speaking, can the data stored in src be correctly copied to dst , in the following code?简单来说,下面的代码中src存储的数据能否正确复制到dst中?

volatile bool flag = false;

// In thread A.
memset(mid, src, size);
__asm__ __volatile__("sfence" ::: "memory");
flag = true;

// In thread B.
while (flag == false);
__asm__ __volatile__("lfence" ::: "memory");
memset(dst, mid, size);

https://gcc.gnu.org/wiki/DontUseInlineAsm https://gcc.gnu.org/wiki/DontUseInlineAsm

Don't use this code in practice, use std::atomic<bool> with memory_order_release and acquire to get the same asm code-gen (but without the unnecessary lfence and sfence)不要在实践中使用此代码,使用std::atomic<bool>memory_order_releaseacquire相同的 asm 代码生成(但没有不必要的 lfence 和 sfence)


But yes, this looks safe , for compilers that define the behaviour of volatile such that data-race UB on the volatile bool flag isn't a problem.但是,是的,这看起来是安全的,对于定义volatile行为的编译器,以便volatile bool flag上的数据竞争 UB 不是问题。 This is the case for compilers like GCC that can compile the Linux kernel (which rolls its own atomics using volatile like you're doing).这就是像 GCC 这样可以编译 Linux 内核的编译器的情况(它像你一样使用volatile滚动自己的原子)。

ISO C++ doesn't strictly require this, for example a hypothetical implementation might exist on a machine without coherent shared memory, so atomic stores would require explicit flushing. ISO C++ 并不严格要求这样做,例如假设的实现可能存在于没有一致共享内存的机器上,因此原子存储将需要显式刷新。 But in practice there aren't any such implementations.但实际上没有任何这样的实现。 (There are some embedded systems where volatile stores use different or extra instructions to make MMIO work, though.) (不过,在一些嵌入式系统中, volatile存储使用不同或额外的指令来使 MMIO 工作。)


A barrier before a store makes it a release store, and a barrier after a load makes it an acquire load.存储前的屏障使其成为释放存储,加载后的屏障使其成为获取加载。https://preshing.com/20120913/acquire-and-release-semantics/ .https://preshing.com/20120913/acquire-and-release-semantics/ Happens Before can be established with just a release store seen by an acquire load. Happens Before 可以通过获取负载看到的发布存储来建立。

The x86 asm memory model already forbids all reordering except StoreLoad, so only compile-time reordering needs to be blocks. x86 asm 内存模型已经禁止除了 StoreLoad 之外的所有重新排序,因此只有编译时重新排序需要是块。 This will compile to asm that's the same as what you'd get from using std::atomic<bool> with mo_release and mo_acquire , except for those inefficient LFENCE and SFENCE instructions.这将编译为 asm,这与将std::atomic<bool>mo_releasemo_acquire一起使用所得到的结果相同,除了那些低效的 LFENCE 和 SFENCE 指令。

C++ How is release-and-acquire achieved on x86 only using MOV? C++ 如何仅使用 MOV 在 x86 上实现释放和获取? explains why the x86 asm memory model is at least as strong as acq_rel.解释了为什么 x86 asm 内存模型至少与 acq_rel 一样强大。


The sfence and lfence instructions inside the asm statements are totally irrelevant , only the asm("" ::: "memory") compiler barrier part is needed. asm 语句中的sfencelfence指令完全无关,只需要asm("" ::: "memory")编译器屏障部分。 https://preshing.com/20120625/memory-ordering-at-compile-time/ . https://preshing.com/20120625/memory-ordering-at-compile-time/ Compile-time reordering only has to respect the C++ memory model, but whatever the compiler picks is then nailed down by the x86 memory model.编译时重新排序只需要遵守 C++ 内存模型,但是编译器选择的任何内容都由 x86 内存模型确定。 (Program-order + store buffer with store forwarding = slightly stronger than acq_rel) (程序顺序+带存储转发的存储缓冲区=略强于acq_rel)

(A GNU C asm statement with no output operands is implicitly volatile so I'm omitting the explicit volatile .) (没有输出操作数的 GNU C asm语句是隐式 volatile 所以我省略了显式volatile 。)

(Unless you're trying to synchronize NT stores? If so you only need sfence , not lfence .) Does the Intel Memory Model make SFENCE and LFENCE redundant? (除非您尝试同步 NT 存储?如果是这样,您只需要sfence ,而不是lfence 。) 英特尔内存模型是否使 SFENCE 和 LFENCE 变得多余? yes.是的。 A memset that internally uses NT stores will use sfence itself, to make itself compatible with the standard C++ atomics / ordering -> asm mapping used on x86.内部使用 NT 存储的 memset 将使用sfence本身,以使其与 x86 上使用的标准 C++ 原子/排序 -> asm 映射兼容。 If you use a different mapping (like freely using NT stores without sfence), you could in theory break mutex critical sections unless you roll your own mutexes, too.如果您使用不同的映射(例如在没有 sfence 的情况下自由使用 NT 存储),理论上您可以破坏互斥锁临界区,除非您也推出自己的互斥锁。 (In practice most mutex implementations use a lock ed instruction in take and release, which is a full barrier.) (实际上,大多数互斥体实现在 take 和 release 中使用lock ed 指令,这是一个完整的屏障。)

An empty asm statement with a memory clobber is sort of a roll-your-own equivalent to atomic_thread_fence(std::memory_order_acquire_release) because of x86's memory model.由于 x86 的内存模型,带有内存破坏的空 asm 语句有点atomic_thread_fence(std::memory_order_acquire_release)的滚动你自己的。 atomic_thread_fence(acq_rel) will compile to zero asm instructions, just blocking compile-time reordering. atomic_thread_fence(acq_rel)将编译为零 asm 指令,只是阻止编译时重新排序。

Only seq_cst thread fence needs to emit any asm instructions to flush the store buffer and wait for that to happen before any later loads.只有 seq_cst 线程栅栏需要发出任何 asm 指令来刷新存储缓冲区,并在任何后续加载之前等待它发生。 aka a full barrier (like mfence or a lock ed instruction like lock add qword ptr [rsp], 0 ).又名一个完整屏障(如mfencelock等编指令lock add qword ptr [rsp], 0 )。


Don't roll your own atomics using volatile and inline asm不要使用volatile和内联 asm 滚动你自己的原子

Yes, you can, and I hope you were just asking to understand how things work.是的,你可以,我希望你只是想了解事情是如何运作的。

You ended up making something much less efficient than it needed to be because you used lfence (an out-of-order execution barrier that's essentially useless for memory ordering) instead of just a compiler barrier.由于您使用了lfence (一种乱序执行屏障,对于内存排序基本上无用),而不仅仅是编译器屏障,因此您最终使某些东西的效率远低于它所需的效率。 And an unnecessary sfence .还有一个不必要的sfence

See When should I use _mm_sfence _mm_lfence and _mm_mfence for basically the same problem but using intrinsics instead of inline asm.请参阅何时应该使用 _mm_sfence _mm_lfence 和 _mm_mfence解决基本相同的问题,但使用内在函数而不是内联 asm。 Generally you only want _mm_sfence() after NT-store intrinsics, and you should leave mfence up to the compiler with std::atomic .通常,您只需要_mm_sfence()在 NT 存储内在函数之后使用,并且您应该将mfence留给编译器使用std::atomic

When to use volatile with multi threading? 什么时候在多线程中使用 volatile? - normally never; - 通常从不; use std::atomic with mo_relaxed instead of volatile .使用std::atomicmo_relaxed而不是volatile

If you're asking about C++ memory model then the answer is no, your code is not thread-safe for multiple reasons:如果您问的是 C++ 内存模型,那么答案是否定的,出于多种原因,您的代码不是线程安全的:

  1. In C++ memory model, concurrent access to an object from multiple threads, where at least one access is a modification, constitute a data race , which is UB.在C++内存模型中,多线程并发访问一个对象,其中至少一次访问是修改,构成了数据竞争,即UB。 The only exception to this rule are thread synchronization primitives, like atomics, mutexes, condition variables, etc. Volatile variables are not an exception.此规则的唯一例外是线程同步原语,如原子、互斥体、条件变量等。易失性变量也不例外。
  2. Accesses to the variable flag are not required to be atomic, even if it is marked as volatile .对变量flag访问不需要是原子的,即使它被标记为volatile This means that thread B may observe a value that was not stored in flag , including a trap representation (ie a representation that corresponds to no valid value of bool ).这意味着线程 B 可能会观察到未存储在flag的值,包括陷阱表示(即与bool无效值对应的表示)。 Using such a value may produce undefined behavior, for example, the observed value of flag may not be equal to either true or false .使用这样的值可能会产生未定义的行为,例如,观察到的flag值可能不等于truefalse
  3. Writing or reading a volatile variable does not constitute a "happens-before" relation.写入或读取volatile变量不构成“发生在之前”的关系。 In other words, it is not a compiler or hardware fence, which allows the surrounding code to be reordered around the volatile reads or writes by either the compiler or the CPU.换句话说,它不是编译器或硬件栅栏,它允许编译器或 CPU 围绕volatile读取或写入对周围代码进行重新排序。 Your attempt to introduce a fence with the asm block is not portable.您尝试使用 asm 块引入围栏是不可移植的。

Practically speaking, your code may produce a sequence of x86 instructions that will behave as you would expect.实际上,您的代码可能会生成一系列 x86 指令,这些指令的行为与您预期的一样。 This would be a pure coincidence given that:鉴于以下情况,这纯属巧合:

  1. sizeof(bool) == 1 on x86 on pretty much every OS, and storing and loading bytes is atomic on x86. sizeof(bool) == 1在几乎所有操作系统上的 x86 上,并且在 x86 上存储和加载字节是原子的。 Note that there are platforms where sizeof(bool) > 1 and thus accessing it may not be atomic.请注意,有些平台的sizeof(bool) > 1 ,因此访问它可能不是原子的。
  2. On x86, regular stores and loads are ordered.在 x86 上,常规存储和加载是有序的。 In other words, a later store cannot be reordered before an earlier one by the CPU;换句话说,CPU 不能在较早的存储之前重新排序较晚的存储; same with loads.与负载相同。 Many other CPU architectures are not so strict.许多其他 CPU 架构没有那么严格。
  3. Most compilers will avoid reordering code around volatile operations.大多数编译器会避免围绕易失性操作重新排序代码。 Some compilers, like MSVC, for example, even consider volatile operations as compiler fences.一些编译器,例如 MSVC,甚至将 volatile 操作视为编译器栅栏。 That is not the case with gcc though.但是,gcc 的情况并非如此。 Luckily, the __volatile__ qualifier prevents the compiler from reordering the asm block (and the fence it implements) with the surrounding code.幸运的是, __volatile__限定符防止编译__volatile__周围的代码重新排序 asm 块(及其实现的栅栏)。 This will make the asm blocks with hardware fences effective with that compiler and compatible ones.这将使带有硬件围栏的 asm 块与该编译器和兼容的编译器一起有效。

But I'll repeat, if the code works, then only by coincidence.但我要重复一遍,如果代码有效,那只是巧合。 It doesn't have to, even on x86, as the compiler is free to optimize this code as it wants to, since, as far as it is concerned, no thread concurrency is involved here.即使在 x86 上,也不必这样做,因为编译器可以随意优化此代码,因为就它而言,此处不涉及线程并发性。 You may rely on guarantees provided by the specific compiler, such as non-standard semantics of volatile , intrinsics and asm blocks, but at that point your program is not portable C/C++ and is written for the specific compiler, possibly with a specific set of command line switches.您可能依赖特定编译器提供的保证,例如volatile 、内部函数和 asm 块的非标准语义,但此时您的程序不是可移植的 C/C++ 并且是为特定编译器编写的,可能具有特定的集合命令行开关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM