简体繁体中英

Memory ordering restrictions on x86 architecture

原文 2012-05-10 15:54:41 8 5 c++/ multithreading/ architecture/ c++11/ memory-model

In his great book 'C++ Concurrency in Action' Anthony Williams writes the following (page 309):

For example, on x86 and x86-64 architectures, atomic load operations are always the same, whether tagged memory_order_relaxed or memory_order_seq_cst (see section 5.3.3). This means that code written using relaxed memory ordering may work on systems with an x86 architecture, where it would fail on a system with a finer- grained set of memory-ordering instructions such as SPARC.

Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst ? In addition, on the cppreference std::memory_order site is mentioned that on x86 release-aquire ordering is automatic.

If this restriction is valid, do the orderings still apply to compiler optimizations?

5 answers

Yes, ordering still applies to compiler optimizations.

Also, it is not entirely exact that on x86 "atomic load operations are always the same".

On x86, all loads done with mov have acquire semantics and all stores done with mov have release semantics. So acq_rel, acq and relaxed loads are simple mov s, and similarly acq_rel, rel and relaxed stores (acq stores and rel loads are always equal to relaxed).

This however is not necessarily true for seq_cst: the architecture does not guarantee seq_cst semantics for mov . In fact, the x86 instruction set does not have any specific instruction for sequentially consistent loads and stores. Only atomic read-modify-write operations on x86 will have seq_cst semantics. Hence, you could get seq_cst semantics for loads by doing a fetch_and_add operation ( lock xadd instruction) with an argument of 0, and seq_cst semantics for stores by doing a seq_cst exchange operation ( xchg instruction) and discarding the previous value.

But you do not need to do both! As long as all seq_cst stores are done with xchg , seq_cst loads can be implemented simply with a mov . Dually, if all loads were done with lock xadd , seq_cst stores could be implemented simply with a mov .

xchg and lock xadd are much slower than mov . Because a program has (usually) more loads than stores, it is convenient to do seq_cst stores with xchg so that the (more frequent) seq_cst loads can simply use a mov . This implementation detail is codified in the x86 Application Binary Interface (ABI). On x86, a compliant compiler must compile seq_cst stores to xchg so that seq_cst loads (which may appear in another translation unit, compiled with a different compiler) can be done with the faster mov instruction.

Thus it is not true in general that seq_cst and acquire loads are done with the same instruction on x86. It is only true because the ABI specifies that seq_cst stores be compiled to an xchg .

The compiler must of course follow the rules of the language, whatever hardware it runs on.

What he says is that on an x86 you don't have relaxed ordering, so you get a stricter ordering even if you don't ask for it. That also means that such code tested on an x86 might not work properly on a system that does have relaxed ordering.

It is worth keeping in mind that although a load relaxed and seq_cst load may map to the same instruction on x86, they are not the same. A load relaxed can be freely reordered by the compiler across memory operations to different memory locations while a seq_cst load cannot be reordered across other memory operations.

The sentence from the book is written in a somewhat misleading way. The ordering obtained on an architecture depends on not just how you translate atomic loads, but how you translate atomic stores.

The usual way to implement seq_cst on x86 is to flush the store buffer at some point between any seq_cst store and a subsequent seq_cst load from the same thread. The usual way for the compiler to guarantee this is to flush after stores, since there are fewer stores than loads. In this translation, seq_cst loads don't need to flush.

If you program x86 with just plain loads and stores, loads are guaranteed to provide acquire semantics, not seq_cst .

As for compiler optimization, in C11/C++11, the compiler does optimizations depending on code movement based on the semantics of the particular atomics, before considering the underlying hardware. (The hardware might provide stronger ordering, but there's no reason for the compiler to restrict its optimizations because of this.)

Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst ?

Only executions (of a program, of some inter thread visible operations in a program) can be sequential. A single operation is not in itself sequential.

Asking whether the implementation of a single isolated operation is sequential is a meaningless question.

The translation of all memory operations that need some guarantee must be done following a strategy that enables that guarantee. There can be different strategies that have different compiler complexity costs and runtime costs.

[Just that there are different strategies to implement virtual functions: the only one that is OK (that fits all our expectations of speed, predictability and simplicity) is the use of vtables, so all compilers use vtable, but a virtual function is not defined as going through the vtable.]

In practice , there are not widely different strategies used to implement memory_order_seq_cst operations on a given CPU ( that I know of ). The differences between compilers are small and do not impede binary compatibility. But there are potentially differences and advanced global optimization of multi-threaded programs might open new opportunities for more efficient code generation for atomic operations.

Depending on your compiler, a program that contains only relaxed loads and memory_order_seq_cst modifications of std::atomic<> objects may or may not have exhibit only sequential behaviors, even on a strongly ordered CPU.

C++11 atomic x86 memory ordering

x86 relaxed ordering performance?

x86 memory ordering test shows reordering where Intel's manual says there shouldn't be?

How is the x64 architecture different from x86

X86 architecture - Set Vector size with unsigned long long

memory requirement in x86 & x64 processors

x86 Memory Alignment of struct vs. cache line?

x86, C++, gcc and memory alignment

Can modern x86 hardware not store a single byte to memory?

std::memory_order_acquire fence necessary on x86?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question C++11 atomic x86 memory ordering x86 relaxed ordering performance? x86 memory ordering test shows reordering where Intel's manual says there shouldn't be? How is the x64 architecture different from x86 X86 architecture - Set Vector size with unsigned long long memory requirement in x86 & x64 processors x86 Memory Alignment of struct vs. cache line? x86, C++, gcc and memory alignment Can modern x86 hardware not store a single byte to memory? std::memory_order_acquire fence necessary on x86?

Related Tags

Memory ordering restrictions on x86 architecture

Question

5 answers

solution1
6 2013-08-29 13:21:57

solution2
2 2012-05-10 16:18:45

solution3
0 2013-09-03 05:38:13

solution4
0 2013-11-07 13:58:19

solution5
0 2019-12-12 03:44:50

Memory ordering restrictions on x86 architecture

Question

5 answers

solution1 6 2013-08-29 13:21:57

solution2 2 2012-05-10 16:18:45

solution3 0 2013-09-03 05:38:13

solution4 0 2013-11-07 13:58:19

solution5 0 2019-12-12 03:44:50

solution1
6 2013-08-29 13:21:57

solution2
2 2012-05-10 16:18:45

solution3
0 2013-09-03 05:38:13

solution4
0 2013-11-07 13:58:19

solution5
0 2019-12-12 03:44:50