简体繁体 English

x86 架构的内存排序限制

[英]Memory ordering restrictions on x86 architecture

原文 2012-05-10 15:54:41 6 5 c++/ multithreading/ architecture/ c++11/ memory-model

In his great book 'C++ Concurrency in Action' Anthony Williams writes the following (page 309): Anthony Williams 在他的伟大著作《C++ 并发实践》中写道（第 309 页）：

For example, on x86 and x86-64 architectures, atomic load operations are always the same, whether tagged memory_order_relaxed or memory_order_seq_cst (see section 5.3.3).例如，在 x86 和 x86-64 架构上，原子加载操作总是相同的，无论是标记 memory_order_relaxed 还是 memory_order_seq_cst（参见第 5.3.3 节）。 This means that code written using relaxed memory ordering may work on systems with an x86 architecture, where it would fail on a system with a finer- grained set of memory-ordering instructions such as SPARC.这意味着使用宽松的内存排序编写的代码可能适用于具有 x86 架构的系统，而在具有更细粒度的内存排序指令集（例如 SPARC）的系统上可能会失败。

Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst ?在 x86 架构上，所有原子加载操作都是memory_order_seq_cst吗？ In addition, on the cppreference std::memory_order site is mentioned that on x86 release-aquire ordering is automatic.此外，在cppreference std::memory_order站点上提到在 x86 上发布获取排序是自动的。

If this restriction is valid, do the orderings still apply to compiler optimizations?如果此限制有效，排序是否仍适用于编译器优化？

5 个解决方案

Yes, ordering still applies to compiler optimizations.是的，排序仍然适用于编译器优化。

Also, it is not entirely exact that on x86 "atomic load operations are always the same".此外，在 x86 上“原子加载操作始终相同”并不完全准确。

On x86, all loads done with mov have acquire semantics and all stores done with mov have release semantics.在 x86 上，使用mov完成的所有加载都具有获取语义，并且使用mov完成的所有存储都具有释放语义。 So acq_rel, acq and relaxed loads are simple mov s, and similarly acq_rel, rel and relaxed stores (acq stores and rel loads are always equal to relaxed).所以acq_rel, acq 和relaxed load 是简单的mov s，类似acq_rel, rel 和relaxed store（acq store 和rel load 总是等于relaxed）。

This however is not necessarily true for seq_cst: the architecture does not guarantee seq_cst semantics for mov .然而，这不是seq_cst不一定是真的：建筑不保证seq_cst语义mov 。 In fact, the x86 instruction set does not have any specific instruction for sequentially consistent loads and stores.事实上，x86 指令集并没有任何特定的指令来实现顺序一致的加载和存储。 Only atomic read-modify-write operations on x86 will have seq_cst semantics.只有 x86 上的原子读-修改-写操作才会有 seq_cst 语义。 Hence, you could get seq_cst semantics for loads by doing a fetch_and_add operation ( lock xadd instruction) with an argument of 0, and seq_cst semantics for stores by doing a seq_cst exchange operation ( xchg instruction) and discarding the previous value.因此，您可以通过执行参数为 0 的 fetch_and_add 操作（ lock xadd指令）来获取加载的 seq_cst 语义，并通过执行 seq_cst 交换操作（ xchg指令）并丢弃之前的值来获取存储的 seq_cst 语义。

But you do not need to do both!但是你不需要两者都做！ As long as all seq_cst stores are done with xchg , seq_cst loads can be implemented simply with a mov .只要所有 seq_cst 存储都使用xchg完成，seq_cst 加载可以简单地使用mov 。 Dually, if all loads were done with lock xadd , seq_cst stores could be implemented simply with a mov .双重地，如果所有加载都使用lock xadd完成， lock xadd seq_cst 存储可以简单地使用mov 。

xchg and lock xadd are much slower than mov . xchg和lock xadd比mov慢得多。 Because a program has (usually) more loads than stores, it is convenient to do seq_cst stores with xchg so that the (more frequent) seq_cst loads can simply use a mov .因为程序（通常）的加载比存储多，所以用xchg进行 seq_cst 存储很方便，这样（更频繁的） seq_cst 加载可以简单地使用mov 。 This implementation detail is codified in the x86 Application Binary Interface (ABI).此实现细节已编入 x86 应用程序二进制接口 (ABI) 中。 On x86, a compliant compiler must compile seq_cst stores to xchg so that seq_cst loads (which may appear in another translation unit, compiled with a different compiler) can be done with the faster mov instruction.在 x86 上，兼容编译器必须将 seq_cst 存储编译为xchg以便可以使用更快的mov指令完成 seq_cst 加载（可能出现在另一个翻译单元中，用不同的编译器编译）。

Thus it is not true in general that seq_cst and acquire loads are done with the same instruction on x86.因此，在 x86 上使用相同的指令完成 seq_cst 和获取加载通常是不正确的。 It is only true because the ABI specifies that seq_cst stores be compiled to an xchg .之所以如此，是因为 ABI 指定将 seq_cst 存储编译为xchg 。

The compiler must of course follow the rules of the language, whatever hardware it runs on.编译器当然必须遵循语言的规则，无论它运行在什么硬件上。

What he says is that on an x86 you don't have relaxed ordering, so you get a stricter ordering even if you don't ask for it.他说的是，在 x86 上，您没有宽松的顺序，因此即使您不要求，您也会得到更严格的顺序。 That also means that such code tested on an x86 might not work properly on a system that does have relaxed ordering.这也意味着，在x86处理器上测试这样的代码可能不是确实有松散排序的系统上正常工作。

It is worth keeping in mind that although a load relaxed and seq_cst load may map to the same instruction on x86, they are not the same.值得记住的是，尽管负载松弛和 seq_cst 负载可能映射到 x86 上的相同指令，但它们并不相同。 A load relaxed can be freely reordered by the compiler across memory operations to different memory locations while a seq_cst load cannot be reordered across other memory operations.加载松弛可以由编译器自由地跨内存操作重新排序到不同的内存位置，而 seq_cst 加载不能跨其他内存操作重新排序。

The sentence from the book is written in a somewhat misleading way.书中的句子写得有点误导。 The ordering obtained on an architecture depends on not just how you translate atomic loads, but how you translate atomic stores.在架构上获得的排序不仅取决于您如何转换原子负载，还取决于您如何转换原子存储。

The usual way to implement seq_cst on x86 is to flush the store buffer at some point between any seq_cst store and a subsequent seq_cst load from the same thread.在 x86 上实现seq_cst的常用方法是在任何seq_cst存储和来自同一线程的后续seq_cst加载之间的某个点刷新存储缓冲区。 The usual way for the compiler to guarantee this is to flush after stores, since there are fewer stores than loads.编译器保证这一点的常用方法是在存储之后刷新，因为存储少于加载。 In this translation, seq_cst loads don't need to flush.在这个翻译中， seq_cst加载不需要刷新。

If you program x86 with just plain loads and stores, loads are guaranteed to provide acquire semantics, not seq_cst .如果您只使用简单的加载和存储对 x86 进行编程，则保证加载提供acquire语义，而不是seq_cst 。

As for compiler optimization, in C11/C++11, the compiler does optimizations depending on code movement based on the semantics of the particular atomics, before considering the underlying hardware.至于编译器优化，在 C11/C++11 中，编译器根据特定原子的语义根据代码移动进行优化，然后再考虑底层硬件。 (The hardware might provide stronger ordering, but there's no reason for the compiler to restrict its optimizations because of this.) （硬件可能会提供更强的排序，但编译器没有理由因此限制其优化。）

Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst ?在 x86 架构上，所有原子加载操作都是memory_order_seq_cst吗？

Only executions (of a program, of some inter thread visible operations in a program) can be sequential.只有（程序的，程序中某些线程间可见操作的）执行才能是顺序的。 A single operation is not in itself sequential.单个操作本身不是顺序的。

Asking whether the implementation of a single isolated operation is sequential is a meaningless question.询问单个隔离操作的实现是否是顺序的是一个毫无意义的问题。

The translation of all memory operations that need some guarantee must be done following a strategy that enables that guarantee.需要某种保证的所有内存操作的转换必须遵循启用该保证的策略。 There can be different strategies that have different compiler complexity costs and runtime costs.可能有不同的策略具有不同的编译器复杂性成本和运行时成本。

[Just that there are different strategies to implement virtual functions: the only one that is OK (that fits all our expectations of speed, predictability and simplicity) is the use of vtables, so all compilers use vtable, but a virtual function is not defined as going through the vtable.] [只是有不同的策略来实现虚函数：唯一可以的（符合我们对速度、可预测性和简单性的所有期望）是使用 vtables，所以所有编译器都使用 vtable，但没有定义虚函数作为通过 vtable。]

In practice , there are not widely different strategies used to implement memory_order_seq_cst operations on a given CPU ( that I know of ).实际上，在给定的 CPU（我知道）上实现memory_order_seq_cst操作的策略并没有很大不同。 The differences between compilers are small and do not impede binary compatibility.编译器之间的差异很小，不会妨碍二进制兼容性。 But there are potentially differences and advanced global optimization of multi-threaded programs might open new opportunities for more efficient code generation for atomic operations.但存在潜在差异，多线程程序的高级全局优化可能为更高效的原子操作代码生成开辟新的机会。

Depending on your compiler, a program that contains only relaxed loads and memory_order_seq_cst modifications of std::atomic<> objects may or may not have exhibit only sequential behaviors, even on a strongly ordered CPU.根据您的编译器，仅包含std::atomic<>对象的宽松加载和memory_order_seq_cst修改的程序可能仅表现出顺序行为，也可能不表现出顺序行为，即使在强有序 CPU 上也是如此。