[英]Acquire/Release versus Sequentially Consistent memory order
For any std::atomic<T>
where T is a primitive type: 对于任何std::atomic<T>
,其中T是原始类型:
If I use std::memory_order_acq_rel
for fetch_xxx
operations, and std::memory_order_acquire
for load
operation and std::memory_order_release
for store
operation blindly (I mean just like resetting the default memory ordering of those functions) 如果我将std::memory_order_acq_rel
用于fetch_xxx
操作,而std::memory_order_acquire
用于load
操作,而std::memory_order_release
用于store
操作则是盲目的(我的意思是就像重置这些函数的默认内存顺序一样)
std::memory_order_seq_cst
(which is being used as default) for any of the declared operations? 结果是否与我对任何已声明的操作使用std::memory_order_seq_cst
(默认设置)相同? std::memory_order_seq_cst
in terms of efficiency? 如果结果相同,就效率而言,此用法与使用std::memory_order_seq_cst
有什么不同吗? The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. 用于原子操作的C ++ 11内存排序参数指定对排序的约束。 If you do a store with std::memory_order_release
, and a load from another thread reads the value with std::memory_order_acquire
then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations . 如果使用std::memory_order_release
进行存储,并且另一个线程的负载使用std::memory_order_acquire
读取值,则第二个线程的后续读取操作将看到第一个线程之前存储在任何内存位置的任何值。存储版本, 或以后存储到这些存储位置中的任何一个 。
If both the store and subsequent load are std::memory_order_seq_cst
then the relationship between these two threads is the same. 如果存储和后续加载均为std::memory_order_seq_cst
则这两个线程之间的关系相同。 You need more threads to see the difference. 您需要更多线程才能看到差异。
eg std::atomic<int>
variables x
and y
, both initially 0. 例如std::atomic<int>
变量x
和y
都初始为0。
Thread 1: 线程1:
x.store(1,std::memory_order_release);
Thread 2: 线程2:
y.store(1,std::memory_order_release);
Thread 3: 线程3:
int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);
Thread 4: 线程4:
int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);
As written, there is no relationship between the stores to x
and y
, so it is quite possible to see a==1
, b==0
in thread 3, and c==1
and d==0
in thread 4. 如所写,到x
和y
的存储之间没有关系,因此很有可能在线程3中看到a==1
, b==0
在线程4中看到c==1
和d==0
。
If all the memory orderings are changed to std::memory_order_seq_cst
then this enforces an ordering between the stores to x
and y
. 如果所有内存排序都更改为std::memory_order_seq_cst
则这将在存储之间强制执行对x
和y
的排序。 Consequently, if thread 3 sees a==1
and b==0
then that means the store to x
must be before the store to y
, so if thread 4 sees c==1
, meaning the store to y
has completed, then the store to x
must also have completed, so we must have d==1
. 因此,如果线程3看到a==1
且b==0
则这意味着x
的存储必须在y
的存储之前,因此,如果线程4看到c==1
,则意味着y
的存储已完成,则y
存储到x
还必须已完成,因此我们必须具有d==1
。
In practice, then using std::memory_order_seq_cst
everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. 实际上,根据编译器和处理器架构的不同,在各处使用std::memory_order_seq_cst
会给加载或存储或两者都增加额外的开销。 eg a common technique for x86 processors is to use XCHG
instructions rather than MOV
instructions for std::memory_order_seq_cst
stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release
a plain MOV
will suffice. 例如,x86处理器的一种常用技术是为std::memory_order_seq_cst
存储使用XCHG
指令而不是MOV
指令,以提供必要的排序保证,而对于std::memory_order_release
,则使用普通的MOV
就足够了。 On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees. 在具有更宽松的内存体系结构的系统上,开销可能会更大,因为普通加载和存储的保证较少。
Memory ordering is hard. 内存排序很难。 I devoted almost an entire chapter to it in my book . 我在书中几乎花了整整一章。
Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle. 内存排序可能非常棘手,而将其弄错的影响通常非常微妙。
The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. 所有内存排序的关键点在于,它保证了“已发生”,而不是即将发生的事情。 For example, if you store something to a couple of variables (eg x = 7; y = 11;
), then another processor may be able to see y
as 11 before it sees the value 7
in x. 例如,如果您将某物存储到几个变量中(例如x = 7; y = 11;
),那么另一个处理器可能会在看到y
中的值7
之前将y
视为11。 By using memory ordering operation between setting x
and setting y
, the processor that you are using will guarantee that x = 7;
通过在设置x
和y
之间使用内存排序操作,您所使用的处理器将保证x = 7;
has been written to memory before it continues to store something in y
. 在继续存储y
之前已被写入内存。
Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. 在大多数情况下,只要最终更新该值,写的顺序并不重要。 But if we, say, have a circular buffer with integers, and we do something like: 但是,例如,如果我们有一个带整数的循环缓冲区,那么我们将执行以下操作:
buffer[index] = 32;
index = (index + 1) % buffersize;
and some other thread is using index
to determine that the new value has been written, then we NEED to have 32
written FIRST, then index
updated AFTER. 而其他一些线程正在使用index
来确定已写入新值,那么我们需要先写入32
,然后在AFTER之后更新index
。 Otherwise, the other thread may get old
data. 否则,另一个线程可能会获取old
数据。
The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types. 使信号量,互斥量等工作也同样适用-这就是为什么术语“释放”和“获取”用于内存屏障类型的原因。
Now, the cst
is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. 现在, cst
是最严格的排序规则-它强制您对已写入的数据进行读写操作,然后再将它们存储到内存中,然后处理器才能继续执行更多操作。 This will be slower than doing the specific acquire or release barriers. 这将比进行特定的获取或释放障碍要慢。 It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads. 它迫使处理器确保存储和加载已完成,而不是仅存储或加载。
How much difference does that make? 那有什么不同? It is highly dependent on what the system archiecture is. 它高度依赖于系统架构是什么。 On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. 在某些系统上,缓存需要部分刷新,并且中断从一个内核发送到另一个内核,并说“请在继续之前进行此刷新工作”,这可能需要数百个周期。 On other processors, it's only some small percentage slower than doing a regular memory write. 在其他处理器上,它只比常规的内存写入慢了一点点。 X86 is pretty good at doing this fast. X86非常擅长快速执行此操作。 Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works. 例如,某些类型的嵌入式处理器(某些型号的-不确定?)需要在处理器中做更多的工作才能确保一切正常。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.