为什么具有顺序一致性的 std::atomic 存储使用 XCHG？

Question

Why is std::atomic 's store :为什么是std::atomic的 store ：

std::atomic<int> my_atomic;
my_atomic.store(1, std::memory_order_seq_cst);

doing an xchg when a store with sequential consistency is requested?当请求具有顺序一致性的存储时执行xchg ？

Shouldn't, technically, a normal store with a read/write memory barrier be enough?从技术上讲，具有读/写内存屏障的普通存储不应该足够吗？ Equivalent to:相当于：

_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);

I'm explicitly talking about x86 & x86_64.我明确地谈论 x86 和 x86_64。 Where a store has an implicit acquire fence.商店有一个隐含的获取栅栏。

Answer 1

mov -store + mfence and xchg are both valid ways to implement a sequential-consistency store on x86. mov mfence + mfence和xchg都是在 x86 上实现顺序一致性存储的有效方法。 The implicit lock prefix on an xchg with memory makes it a full memory barrier, like all atomic RMW operations on x86.带有内存的xchg上的隐式lock前缀使其成为一个完整的内存屏障，就像 x86 上的所有原子 RMW 操作一样。

(x86's memory-ordering rules essentially make that full-barrier effect the only option for any atomic RMW: it's both a load and a store at the same time, stuck together in the global order. Atomicity requires that the load and store aren't separated by just queuing the store into the store buffer so it has to be drained, and load-load ordering of the load side requires that it not reorder.) （x86 的内存排序规则本质上使完全屏障效应成为任何原子 RMW 的唯一选择：它同时是加载和存储，在全局顺序中粘在一起。原子性要求加载和存储不是通过将存储排队到存储缓冲区来分隔，因此必须将其排空，并且负载端的加载-加载排序要求它不会重新排序。）

Plain mov is not sufficient;简单的mov是不够的； it only has release semantics, not sequential-release .它只有释放语义，没有顺序释放。 (Unlike AArch64's stlr instruction, which does do a sequential-release store that can't reorder with later ldar sequential-acquire loads. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.) （与 AArch64 的stlr指令不同，它确实进行了顺序释放存储，该存储不能在以后的ldar顺序获取加载中重新排序。这个选择显然是受到 C++11 将 seq_cst 作为默认内存排序的启发。但 AArch64 的正常存储是弱得多；放松而不是释放。）

See Jeff Preshing's article on acquire / release semantics , and note that regular release stores (like mov or any non-locked x86 memory-destination instruction other than xchg) allows reordering with later operations, including acquire loads (like mov or any x86 memory-source operand).请参阅Jeff Preshing 关于获取 / 释放语义的文章，并注意常规释放存储（如mov或除 xchg 之外的任何非锁定 x86 内存目标指令）允许使用后续操作重新排序，包括获取加载（如 mov 或任何 x86 内存-源操作数）。 eg If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.例如，如果 release-store 正在释放一个锁，那么后面的东西似乎发生在关键部分内是可以的。

There are performance differences between mfence and xchg on different CPUs , and maybe in the hot vs. cold cache and contended vs. uncontended cases. mfence和xchg在不同的 CPU 上存在性能差异，可能在热缓存与冷缓存以及竞争与非竞争情况下。 And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.和/或许多操作在同一线程中背靠背的吞吐量，而不是单独的一个，并且允许周围的代码与原子操作重叠执行。

See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence vs. lock addl $0, -8(%rsp) vs. (%rsp) as a full barrier (when you don't already have a store to do).请参阅https://shipilev.net/blog/2014/on-the-fence-with-dependencies了解mfence与lock addl $0, -8(%rsp)与(%rsp)作为完整屏障的实际基准测试（当您还没有商店要做时）。

On Intel Skylake hardware, mfence blocks out-of-order execution of independent ALU instructions, but xchg doesn't .在 Intel Skylake 硬件上， mfence阻止独立 ALU 指令的乱序执行，但xchg不会。 ( See my test asm + results in the bottom of this SO answer ). （请参阅此 SO 答案底部的我的测试 asm + 结果）。 Intel's manuals don't require it to be that strong;英特尔的手册不需要它那么强大； only lfence is documented to do that.只有lfence被记录在案。 But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.但是作为一个实现细节，在 Skylake 上乱序执行周围代码的代价是非常高的。

I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079 , SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions .我还没有测试过其他 CPU，这可能是由于WC 内存中的错误 SKL079 、 SKL079 MOVNTDQA 的微码修复可能会通过早期MFENCE 指令。 The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE.勘误的存在，基本证明SKL曾经可以在MFENCE之后执行指令。 I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.如果他们通过在微代码中增强 MFENCE 来修复它，我不会感到惊讶，这是一种明显增加对周围代码影响的生硬工具方法。

I've only tested the single-threaded case where the cache line is hot in L1d cache.我只测试了 L1d 缓存中缓存行很热的单线程情况。 (Not when it's cold in memory, or when it's in Modified state on another core.) xchg has to load the previous value, creating a "false" dependency on the old value that was in memory. （不是当它在内存中处于冷态时，或者当它在另一个核心上处于修改状态时。） xchg必须加载先前的值，从而对内存中的旧值创建“假”依赖关系。 But mfence forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state).但是mfence强制 CPU 等待，直到之前的存储提交到 L1d，这也需要缓存行到达（并处于 M 状态）。 So they're probably about equal in that respect, but Intel's mfence forces everything to wait, not just loads.所以他们在这方面可能mfence ，但英特尔的mfence迫使一切都等待，而不仅仅是加载。

AMD's optimization manual recommends xchg for atomic seq-cst stores . AMD 的优化手册推荐xchg用于原子 seq-cst 存储。 I thought Intel recommended mov + mfence , which older gcc uses, but Intel's compiler also uses xchg here.我认为英特尔推荐mov + mfence ，旧的 gcc 使用它，但英特尔的编译器在这里也使用xchg 。

When I tested, I got better throughput on Skylake for xchg than for mov + mfence in a single-threaded loop on the same location repeatedly.当我测试时，我在 Skylake 上为xchg比mov + mfence在同一位置重复的单线程循环更好的吞吐量。 See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.有关详细信息，请参阅Agner Fog 的微架构指南和指令表，但他并没有花太多时间在锁定操作上。

See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4;对于 C++11 seq-cst my_atomic = 4;请参阅Godbolt 编译器资源管理器上的 gcc/clang/ICC/MSVC 输出my_atomic = 4; gcc uses mov + mfence when SSE2 is available.当 SSE2 可用时，gcc 使用mov + mfence 。 (use -m32 -mno-sse2 to get gcc to use xchg too). （使用-m32 -mno-sse2让 gcc 也使用xchg ）。 The other 3 compilers all prefer xchg with default tuning, or for znver1 (Ryzen) or skylake .其他 3 个编译器都更喜欢xchg和默认调整，或者znver1 (Ryzen) 或skylake 。

The Linux kernel uses xchg for __smp_store_mb() . Linux 内核将xchg用于__smp_store_mb() 。

Update: recent GCC (like GCC10) changed to using xchg for seq-cst stores like other compilers do, even when SSE2 for mfence is available.更新：最近的 GCC（如 GCC10）改为使用xchg进行 seq-cst 存储，就像其他编译器一样，即使mfence SSE2 可用。

Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);另一个有趣的问题是如何编译atomic_thread_fence(mo_seq_cst); . . The obvious option is mfence , but lock or dword [rsp], 0 is another valid option (and used by gcc -m32 when MFENCE isn't available).显而易见的选项是mfence ，但lock or dword [rsp], 0 mfence lock or dword [rsp], 0是另一个有效选项（当 MFENCE 不可用时由gcc -m32使用）。 The bottom of the stack is usually already hot in cache in M state.栈底通常在 M 状态的缓存中已经很热了。 The downside is introducing latency if a local was stored there.缺点是如果本地存储在那里会引入延迟。 (If it's just a return address, return-address prediction is usually very good so delaying ret 's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0 could be worth considering in some cases. （如果它只是一个返回地址，返回地址预测通常非常好，因此延迟ret读取它的能力不是什么大问题。）所以lock or dword [rsp-4], 0在某些情况下可能值得考虑案件。 ( gcc did consider it , but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence even when mfence was available.) （ gcc 确实考虑过它，但因为它使 valgrind 不高兴而将其还原。这是在人们知道即使mfence可用时它也可能比mfence更好mfence 。）

All compilers currently use mfence for a stand-alone barrier when it's available.当mfence可用时，所有编译器当前都使用mfence作为独立屏障。 Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.这些在 C++11 代码中很少见，但需要更多的研究来研究什么对于真正的多线程代码实际上最有效，这些代码在无锁通信的线程内部进行实际工作。

But multiple source recommend using lock add to the stack as a barrier instead of mfence , so the Linux kernel recently switched to using it for the smp_mb() implementation on x86, even when SSE2 is available.但是多个来源建议使用lock add作为屏障而不是mfence ，因此 Linux 内核最近切换到将它用于 x86 上的smp_mb()实现，即使 SSE2 可用。

See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa loads from WC memory passing earlier lock ed instructions.有关一些讨论，请参阅https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ ，包括提及 HSW/BDW 的一些勘误表，这些勘误表是关于来自 WC 内存的movntdqa加载，并传递较早的lock ed 指令。 (Opposite of Skylake, where it was mfence instead of lock ed instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence for its mb() for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.) （与 Skylake 相对，它是mfence而不是lock ed 指令，这是一个问题。但与 SKL 不同的是，微代码中没有修复。这可能是 Linux 仍然使用mfence作为其mb()驱动程序的原因，以防万一NT 加载以从视频 RAM 或其他东西复制回来，但在较早的存储可见之前不能让读取发生。）

In Linux 4.14 , smp_mb() uses mb() . 在 Linux 4.14 中， smp_mb()使用mb() 。 That uses mfence is used if available, otherwise lock addl $0, 0(%esp) .如果可用，则使用 mfence ，否则使用lock addl $0, 0(%esp) 。
__smp_store_mb (store + memory barrier) uses xchg (and that doesn't change in later kernels). __smp_store_mb （存储 + 内存屏障）使用xchg （并且在以后的内核中不会改变）。
In Linux 4.15 , smb_mb() uses lock; addl $0,-4(%esp) 在 Linux 4.15 中， smb_mb()使用lock; addl $0,-4(%esp) lock; addl $0,-4(%esp) or %rsp , instead of using mb() . lock; addl $0,-4(%esp)或%rsp ，而不是使用mb() 。 (The kernel doesn't use a red-zone even in 64-bit, so the -4 may help avoid extra latency for local vars). （即使在 64 位内核中也不使用红色区域，因此-4可能有助于避免本地变量的额外延迟）。
mb() is used by drivers to order access to MMIO regions, but smp_mb() turns into a no-op when compiled for a uniprocessor system.驱动程序使用mb()来订购对 MMIO 区域的访问，但smp_mb()在为单处理器系统编译时变为无操作。 Changing mb() is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence.更改mb()风险更大，因为它更难测试（影响驱动程序），并且 CPU 具有与 lock 与 mfence 相关的勘误表。 But anyway, mb() uses mfence if available, else lock addl $0, -4(%esp) .但无论如何， mb()使用 mfence 如果可用，否则lock addl $0, -4(%esp) 。 The only change is the -4 .唯一的变化是-4 。
In Linux 4.16 , no change except removing the #if defined(CONFIG_X86_PPRO_FENCE) which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements. 在 Linux 4.16 中，除了删除为比现代硬件实现的 x86-TSO 模型更弱排序的内存模型定义内容的#if defined(CONFIG_X86_PPRO_FENCE)之外，没有任何变化。

x86 & x86_64. x86 & x86_64。 Where a store has an implicit acquire fence商店具有隐式获取栅栏的地方

You mean release , I hope.你的意思是release ，我希望。 my_atomic.store(1, std::memory_order_acquire); won't compile, because write-only atomic operations can't be acquire operations.不会编译，因为只写原子操作不能是获取操作。 See also Jeff Preshing's article on acquire/release semantics .另请参阅Jeff Preshing 关于获取/释放语义的文章。

Or asm volatile("" ::: "memory");或者asm volatile("" ::: "memory");

No, that's a compiler barrier only;不，那只是编译器障碍； it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering , ie the store being buffered until later, and not appearing in the global order until after a later load.它阻止所有编译时重新排序，但不会阻止运行时 StoreLoad 重新排序，即存储被缓冲直到稍后，并且直到稍后加载之后才出现在全局顺序中。 (StoreLoad is the only kind of runtime reordering x86 allows.) （StoreLoad 是 x86 允许的唯一一种运行时重新排序。）

Anyway, another way to express what you want here is:无论如何，另一种表达你想要的方式是：

my_atomic.store(1, std::memory_order_release);        // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst);  // mfence

Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early).使用发布栅栏不够强大（它和发布存储都可以延迟到稍后的加载，这与说发布栅栏不会阻止较早的加载发生相同的事情）。 A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.但是，发布获取栅栏可以解决问题，防止后期加载过早发生，并且本身无法与发布商店重新排序。

Related: Jeff Preshing's article on fences being different from release operations .相关： Jeff Preshing 关于围栏与发布操作不同的文章。

But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing.但请注意，根据 C++11 规则，seq-cst 是特殊的：只有 seq-cst 操作才能保证具有所有线程都同意的单个全局/总顺序。 So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86.因此，即使是在 x86 上，在 C++ 抽象机器上使用较弱的顺序 + 围栏来模拟它们通常可能并不完全等效。 (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions : Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.) （在 x86 上，所有存储都有一个所有内核都同意的总顺序。另请参阅全局不可见加载指令：加载可以从存储缓冲区中获取数据，因此我们不能真正说加载 + 存储有一个总顺序.)

为什么具有顺序一致性的 std::atomic 存储使用 XCHG？

问题描述

1 个解决方案

解决方案1
19 已采纳 2018-03-05 10:38:28

为什么具有顺序一致性的 std::atomic 存储使用 XCHG？

问题描述

1 个解决方案

解决方案1 19 已采纳 2018-03-05 10:38:28

解决方案1
19 已采纳 2018-03-05 10:38:28