简体   繁体   English

围绕 C++ 中原子负载存储的优化

[英]Optimizations around atomic load stores in C++

I have read about std::memory_order in C++ and understood partially.我已阅读 C++ 中的std::memory_order并部分理解。 But I still had some doubts around it.但我对此仍有一些疑问。

  1. Explanation on std::memory_order_acquire says that, no reads or writes in the current thread can be reordered before this load .关于std::memory_order_acquire的解释说,当前线程中的任何读取或写入都不能在此之前重新排序 load Does that mean compiler and cpu is not allowed to move any instruction present below the acquire statement, above it?这是否意味着编译器和 cpu 不允许在acquire语句下方移动任何指令?
auto y = x.load(std::memory_order_acquire);
z = a;  // is it leagal to execute loading of shared `b` above acquire? (I feel no)
b = 2;  // is it leagal to execute storing of shared `a` above acquire? (I feel yes)

I can reason out why it is illegal for executing loads before acquire .我可以推理为什么在acquire之前执行加载是非法的。 But why it is illegal for stores?但是为什么商店是非法的呢?

  1. Is it illegal to skip a useless load or store from atomic objects?atomic对象跳过无用的加载或存储是否违法? Since they are not volatile , and as I know only volatile has this requirement.因为它们不是volatile ,而且据我所知只有 volatile 有这个要求。
auto y = x.load(std::memory_order_acquire);  // `y` is never used
return;

This optimization is not happening even with relaxed memory order.即使使用relaxed的内存顺序,这种优化也不会发生。

  1. Is compiler allowed to move instructions present above acquire statement, below it?编译器是否允许将出现在acquire语句上方的指令移动到其下方?
z = a;  // is it leagal to execute loading of shared `b` below acquire? (I feel yes)
b = 2;  // is it leagal to execute storing of shared `a` below acquire? (I feel yes)
auto y = x.load(std::memory_order_acquire);
  1. Can two loads or stores be reordered without crossing acquire boundary?可以在不跨越acquire边界的情况下重新排序两个加载或存储吗?
auto y = x.load(std::memory_order_acquire);
a = p;  // can this move below the below line?
b = q;  // shared `a` and `b`

Similar and corresponding 4 doubts with release semantics also.release语义类似且对应的4个疑问也。

Related to 2nd and 3rd question, why no compiler is optimizing f() , as aggressive as g() in below code?与第二个和第三个问题相关,为什么没有编译器在优化f() ,就像下面代码中的g()一样激进?

#include <atomic>

int a, b;

void dummy(int*);

void f(std::atomic<int> &x) {
    int z;
    z = a;  // loading shared `a` before acquire
    b = 2;  // storing shared `b` before acquire
    auto y = x.load(std::memory_order_acquire);
    z = a;  // loading shared `a` after acquire
    b = 2;  // storing shared `b` after acquire
    dummy(&z);
}

void g(int &x) {
    int z;
    z = a;
    b = 2;
    auto y = x;
    z = a;
    b = 2;
    dummy(&z);
}
f(std::atomic<int>&):
        sub     rsp, 24
        mov     eax, DWORD PTR a[rip]
        mov     DWORD PTR b[rip], 2
        mov     DWORD PTR [rsp+12], eax
        mov     eax, DWORD PTR [rdi]
        lea     rdi, [rsp+12]
        mov     DWORD PTR b[rip], 2
        mov     eax, DWORD PTR a[rip]
        mov     DWORD PTR [rsp+12], eax
        call    dummy(int*)
        add     rsp, 24
        ret
g(int&):
        sub     rsp, 24
        mov     eax, DWORD PTR a[rip]
        mov     DWORD PTR b[rip], 2
        lea     rdi, [rsp+12]
        mov     DWORD PTR [rsp+12], eax
        call    dummy(int*)
        add     rsp, 24
        ret
b:
        .zero   4
a:
        .zero   4

Q1第一季度

Generally, yes.一般来说,是的。 Any load or store that follows (in program order) an acquire load, must not become visible before it.任何在获取加载之后(按程序顺序)的加载或存储,在它之前都不能变得可见。

Here is an example where it matters:这是一个重要的例子:

#include <atomic>
#include <thread>
#include <iostream>

std::atomic<int> x{0};
std::atomic<bool> finished{false};
int xval;
bool good;

void reader() {
    xval = x.load(std::memory_order_relaxed);
    finished.store(true, std::memory_order_release);
}

void writer() {
    good = finished.load(std::memory_order_acquire);
    x.store(42, std::memory_order_relaxed);
}

int main() {
    std::thread t1(reader);
    std::thread t2(writer);
    t1.join();
    t2.join();
    if (good) {
        std::cout << xval << std::endl;
    } else {
        std::cout << "too soon" << std::endl;
    }
    return 0;
}

Try on godbolt试试神器

This program is free of UB and must print either 0 or too soon .这个程序没有 UB 并且必须打印0too soon If the writer store of 42 to x could be reordered before the load of finished , then it would be possible that the reader load of x returns 42 and the writer load of finished returns true , in which case the program would improperly print 42 .如果 42 到xwriter器存储可以在加载finished之前重新排序,那么有可能xreader加载返回 42 并且finishedwriter器加载返回true ,在这种情况下程序将不正确地打印42

Q2第二季度

Yes, it would be okay for a compiler to delete the atomic load whose value is never used, since there is no way for a conforming program to detect whether the load was done or not.是的,编译器可以删除其值从未使用过的原子加载,因为符合标准的程序无法检测加载是否完成。 However, current compilers generally don't do such optimizations.但是,当前的编译器通常不会进行此类优化。 Partly out of an abundance of caution, because optimizations on atomics are hard to get right and bugs can be very subtle.部分出于谨慎考虑,因为原子优化很难做到正确,并且错误可能非常微妙。 It may also be partly to support programmers writing implementation-dependent code, that is able to detect via non-standard features whether the load was done.它也可能部分支持程序员编写依赖于实现的代码,即能够通过非标准特性检测加载是否完成。

Q3第三季度

Yes, this reordering is perfectly legal, and real-world architectures will do it.是的,这种重新排序是完全合法的,现实世界的架构会这样做。 An acquire barrier is only one way.获取障碍只是一种方式。

Q4第四季度

Yes, this is also legal.是的,这也是合法的。 If a,b are not atomic, and some other thread is reading them concurrently, then the code has a data race and is UB, so it is okay if the other thread observes the writes having happened in the wrong order (or summons nasal demons).如果a,b不是原子的,并且某个其他线程正在同时读取它们,则代码存在数据竞争并且是 UB,因此如果其他线程观察到写入发生的顺序错误(或召唤鼻恶魔)也没关系)。 (If they are atomic and you are doing relaxed stores, then you can't get nasal demons, but you can still observe the stores out of order; there is no happens-before relationship mandating the contrary.) (如果它们是原子的并且你正在做轻松的存储,那么你不会得到鼻恶魔,但你仍然可以观察到无序的存储;没有发生相反的关系。)

Optimization comparison优化对比

Your f versus g examples is not really a fair comparison: in g , the load of the non-atomic variable x has no side effects and its value is not used, so the compiler omits it altogether.您的fg示例并不是真正公平的比较:在g中,非原子变量x的负载没有副作用,并且未使用其值,因此编译器完全省略了它。 As mentioned above, the compiler doesn't omit the unnecessary atomic load of x in f .如上所述,编译器不会忽略fx的不必要的原子负载。

As to why the compilers don't sink the first accesses to a and b past the acquire load: I believe it's simply a missed optimization.至于为什么编译器不会在获取负载之后对ab的第一次访问下沉:我相信这只是一个错过的优化。 Again, most compilers deliberately don't try to do all possible legal optimizations with atomics.同样,大多数编译器故意不尝试使用原子进行所有可能的合法优化。 However, you could note that on ARM64 for instance, the load of x in f compiles to ldar , which the CPU can definitely reorder with earlier plain loads and stores但是,您可能会注意到,例如在 ARM64 上, fx的加载编译为ldar ,CPU 肯定可以使用早期的普通加载和存储重新排序

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM