[英]Optimizations around atomic load stores in C++
I have read about std::memory_order
in C++ and understood partially.我已阅读 C++ 中的
std::memory_order
并部分理解。 But I still had some doubts around it.但我对此仍有一些疑问。
std::memory_order_acquire
says that, no reads or writes in the current thread can be reordered before this load .std::memory_order_acquire
的解释说,当前线程中的任何读取或写入都不能在此之前重新排序 load 。 Does that mean compiler and cpu is not allowed to move any instruction present below the acquire
statement, above it?acquire
语句下方移动任何指令?auto y = x.load(std::memory_order_acquire);
z = a; // is it leagal to execute loading of shared `b` above acquire? (I feel no)
b = 2; // is it leagal to execute storing of shared `a` above acquire? (I feel yes)
I can reason out why it is illegal for executing loads before acquire
.我可以推理为什么在
acquire
之前执行加载是非法的。 But why it is illegal for stores?但是为什么商店是非法的呢?
atomic
objects?atomic
对象跳过无用的加载或存储是否违法? Since they are not volatile
, and as I know only volatile has this requirement.volatile
,而且据我所知只有 volatile 有这个要求。auto y = x.load(std::memory_order_acquire); // `y` is never used
return;
This optimization is not happening even with relaxed
memory order.即使使用
relaxed
的内存顺序,这种优化也不会发生。
acquire
statement, below it?acquire
语句上方的指令移动到其下方?z = a; // is it leagal to execute loading of shared `b` below acquire? (I feel yes)
b = 2; // is it leagal to execute storing of shared `a` below acquire? (I feel yes)
auto y = x.load(std::memory_order_acquire);
acquire
boundary?acquire
边界的情况下重新排序两个加载或存储吗?auto y = x.load(std::memory_order_acquire);
a = p; // can this move below the below line?
b = q; // shared `a` and `b`
Similar and corresponding 4 doubts with release
semantics also.与
release
语义类似且对应的4个疑问也。
Related to 2nd and 3rd question, why no compiler is optimizing f()
, as aggressive as g()
in below code?与第二个和第三个问题相关,为什么没有编译器在优化
f()
,就像下面代码中的g()
一样激进?
#include <atomic>
int a, b;
void dummy(int*);
void f(std::atomic<int> &x) {
int z;
z = a; // loading shared `a` before acquire
b = 2; // storing shared `b` before acquire
auto y = x.load(std::memory_order_acquire);
z = a; // loading shared `a` after acquire
b = 2; // storing shared `b` after acquire
dummy(&z);
}
void g(int &x) {
int z;
z = a;
b = 2;
auto y = x;
z = a;
b = 2;
dummy(&z);
}
f(std::atomic<int>&):
sub rsp, 24
mov eax, DWORD PTR a[rip]
mov DWORD PTR b[rip], 2
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rdi]
lea rdi, [rsp+12]
mov DWORD PTR b[rip], 2
mov eax, DWORD PTR a[rip]
mov DWORD PTR [rsp+12], eax
call dummy(int*)
add rsp, 24
ret
g(int&):
sub rsp, 24
mov eax, DWORD PTR a[rip]
mov DWORD PTR b[rip], 2
lea rdi, [rsp+12]
mov DWORD PTR [rsp+12], eax
call dummy(int*)
add rsp, 24
ret
b:
.zero 4
a:
.zero 4
Generally, yes.一般来说,是的。 Any load or store that follows (in program order) an acquire load, must not become visible before it.
任何在获取加载之后(按程序顺序)的加载或存储,在它之前都不能变得可见。
Here is an example where it matters:这是一个重要的例子:
#include <atomic>
#include <thread>
#include <iostream>
std::atomic<int> x{0};
std::atomic<bool> finished{false};
int xval;
bool good;
void reader() {
xval = x.load(std::memory_order_relaxed);
finished.store(true, std::memory_order_release);
}
void writer() {
good = finished.load(std::memory_order_acquire);
x.store(42, std::memory_order_relaxed);
}
int main() {
std::thread t1(reader);
std::thread t2(writer);
t1.join();
t2.join();
if (good) {
std::cout << xval << std::endl;
} else {
std::cout << "too soon" << std::endl;
}
return 0;
}
This program is free of UB and must print either 0
or too soon
.这个程序没有 UB 并且必须打印
0
或too soon
。 If the writer
store of 42 to x
could be reordered before the load of finished
, then it would be possible that the reader
load of x
returns 42 and the writer
load of finished
returns true
, in which case the program would improperly print 42
.如果 42 到
x
的writer
器存储可以在加载finished
之前重新排序,那么有可能x
的reader
加载返回 42 并且finished
的writer
器加载返回true
,在这种情况下程序将不正确地打印42
。
Yes, it would be okay for a compiler to delete the atomic load whose value is never used, since there is no way for a conforming program to detect whether the load was done or not.是的,编译器可以删除其值从未使用过的原子加载,因为符合标准的程序无法检测加载是否完成。 However, current compilers generally don't do such optimizations.
但是,当前的编译器通常不会进行此类优化。 Partly out of an abundance of caution, because optimizations on atomics are hard to get right and bugs can be very subtle.
部分出于谨慎考虑,因为原子优化很难做到正确,并且错误可能非常微妙。 It may also be partly to support programmers writing implementation-dependent code, that is able to detect via non-standard features whether the load was done.
它也可能部分支持程序员编写依赖于实现的代码,即能够通过非标准特性检测加载是否完成。
Yes, this reordering is perfectly legal, and real-world architectures will do it.是的,这种重新排序是完全合法的,现实世界的架构会这样做。 An acquire barrier is only one way.
获取障碍只是一种方式。
Yes, this is also legal.是的,这也是合法的。 If
a,b
are not atomic, and some other thread is reading them concurrently, then the code has a data race and is UB, so it is okay if the other thread observes the writes having happened in the wrong order (or summons nasal demons).如果
a,b
不是原子的,并且某个其他线程正在同时读取它们,则代码存在数据竞争并且是 UB,因此如果其他线程观察到写入发生的顺序错误(或召唤鼻恶魔)也没关系)。 (If they are atomic and you are doing relaxed stores, then you can't get nasal demons, but you can still observe the stores out of order; there is no happens-before relationship mandating the contrary.) (如果它们是原子的并且你正在做轻松的存储,那么你不会得到鼻恶魔,但你仍然可以观察到无序的存储;没有发生相反的关系。)
Your f
versus g
examples is not really a fair comparison: in g
, the load of the non-atomic variable x
has no side effects and its value is not used, so the compiler omits it altogether.您的
f
与g
示例并不是真正公平的比较:在g
中,非原子变量x
的负载没有副作用,并且未使用其值,因此编译器完全省略了它。 As mentioned above, the compiler doesn't omit the unnecessary atomic load of x
in f
.如上所述,编译器不会忽略
f
中x
的不必要的原子负载。
As to why the compilers don't sink the first accesses to a
and b
past the acquire load: I believe it's simply a missed optimization.至于为什么编译器不会在获取负载之后对
a
和b
的第一次访问下沉:我相信这只是一个错过的优化。 Again, most compilers deliberately don't try to do all possible legal optimizations with atomics.同样,大多数编译器故意不尝试使用原子进行所有可能的合法优化。 However, you could note that on ARM64 for instance, the load of
x
in f
compiles to ldar
, which the CPU can definitely reorder with earlier plain loads and stores但是,您可能会注意到,例如在 ARM64 上,
f
中x
的加载编译为ldar
,CPU 肯定可以使用早期的普通加载和存储重新排序
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.