任何可用的操作/围栏比发布更弱但仍提供同步语义？

Question

std::memory_order_release and std::memory_order_acquire operations provide the synchronize-with semantic. std::memory_order_release和std::memory_order_acquire操作提供同步语义。

In addition to that, std::memory_order_release guarantees that all loads and stores can't be reordered past the release operation.除此之外， std::memory_order_release保证所有加载和存储不能在释放操作之后重新排序。

Questions:问题：

Is there anything in C++20/23 that provides the same synchronized-with semantic but isn't as strong as std::memory_order_release such that loads can be reordered past the release operation? C++20/23 中是否有任何东西提供相同的同步语义但不如std::memory_order_release强，以便可以在释放操作之后重新排序加载？ In a hope that the out-of-order code is more optimized (by compiler or by CPU).希望乱序代码得到更多优化（通过编译器或 CPU）。
Let's say there is no such thing in C++20/23, is there any no standard way to do so (eg some inline asm) for x86 on linux?假设在 C++20/23 中没有这样的东西，对于 linux 上的 x86 是否有任何标准的方法（例如一些内联汇编）？

Answer 1

ISO C++ only has three orderings that apply to stores: relaxed , release and seq_cst . ISO C++ 只有三个适用于商店的顺序： relaxed 、 release和seq_cst 。 Relaxed is clearly too weak, and seq_cst is strictly stronger than release . Relaxed 显然太弱了，而seq_cst严格来说比release强。 So, no.所以不行。

The property that neither loads nor stores may be reordered past a release store is necessary to provide the synchronize-with semantics that you want, and can't be weakened in any way I can think of without breaking them.加载和存储都不能通过发布存储重新排序的属性是提供您想要的同步语义所必需的，并且不能以我能想到的任何方式削弱而不破坏它们。 The point of synchronize-with is that a release store can be used as the end of a critical section. synchronize-with 的要点是发布存储可以用作关键部分的结尾。 Operations within that critical section, both loads and stores, have to stay there.该关键部分内的操作，包括加载和存储，都必须留在那里。

Consider the following code:考虑以下代码：

std::atomic<bool> go{false};
int crit = 17;

void thr1() {
    int tmp = crit;
    go.store(true, std::memory_order_release);
    std::cout << tmp << std::endl;
}

void thr2() {
    while (!go.load(std::memory_order_acquire)) {
        // delay
    }
    crit = 42;
}

This program is free of data races and must output 17 .该程序没有数据竞争，必须 output 17 。 This is because the release store in thr1 synchronizes with the final acquire load in thr2, the one that returns true (thus taking its value from the store).这是因为 thr1 中的发布存储与 thr2 中的最终获取负载同步，后者返回true （因此从存储中获取其值）。 This implies that the load of crit in thr1 happens-before the store in thr2 , so they don't race, and the load does not observe the store.这意味着 thr1 中的crit负载发生在thr2中的存储之前，因此它们不会竞争，并且负载不会观察到存储。

If we replaced the release store in thr1 with your hypothetical half-release store, such that the load of crit could be reordered after go.store(true, half_release) , then that load might take place any amount of time later.如果我们将 thr1 中的发布存储替换为您假设的半发布存储，这样crit的负载可以在go.store(true, half_release)之后重新排序，那么该负载可能会在任何时间后发生。 It could in particular happen concurrently with, or even after, the store of crit in thr2.它特别可能与 thr2 中的crit存储同时发生，甚至之后发生。 So it could read 42 , or garbage, or anything else could happen.所以它可以读取42或垃圾，或任何其他可能发生的事情。 This should not be possible if go.store(true, half_release) really did synchronize with go.load(acquire) .如果go.store(true, half_release)确实与go.load(acquire)同步，这应该是不可能的。

Answer 2

ISO C++国际标准化组织 C++

In ISO C++, no, release is the minimum for the writer side of doing some (possibly non-atomic) stores and then storing a data_ready flag.在 ISO C++ 中，不， release是编写器端进行一些（可能是非原子的）存储然后存储data_ready标志的最小值。 Or for locking / mutual exclusion, to keep loads before a release store and stores after an acquire load (no LoadStore reordering).或者为了锁定/互斥，在发布存储之前保持加载，在获取加载之后保持存储（没有 LoadStore 重新排序）。 Or anything else happens-before gives you.或者其他任何事情发生之前给你。 (C++'s model works in terms of guarantees on what a load can or must see, not in terms of local reordering of loads and stores from a coherent cache. I'm talking about how they're mapped into asm for normal ISAs .) acq_rel RMWs or seq_cst stores or RMWs also work, but are stronger than release . （C++ 的 model 在保证加载可以或必须看到的内容方面起作用，而不是在从连贯缓存中对加载和存储进行本地重新排序方面起作用。我说的是它们如何映射到普通 ISA 的 asm 中。） acq_rel RMWs 或seq_cst商店或 RMWs 也可以工作，但比release更强。

Asm with weaker guarantees that might be sufficient for some cases具有较弱保证的 Asm 在某些情况下可能就足够了

In asm for some platform, perhaps there might be something weaker you could do, but it wouldn't be fully happens-before.在某些平台的 asm 中，也许你可以做一些更弱的事情，但它不会完全发生在之前。 I don't think there are any requirements on release which are superfluous to happens-before and normal acq/rel synchronization.我不认为对 release 有任何要求是多余的，而不是 happens-before 和正常的 acq/rel 同步。 ( https://preshing.com/20120913/acquire-and-release-semantics/ ). ( https://preshing.com/20120913/acquire-and-release-semantics/ )。

Some common use cases for acq/rel sync only needs StoreStore ordering on the writer side, LoadLoad on the reader side . acq/rel sync 的一些常见用例只需要写入端的 StoreStore 排序，读取端的 LoadLoad 。 (eg producer / consumer with one-way communication, non-atomic stores and a data_ready flag.) Without the LoadStore ordering requirement , I could imagine either the writer or reader being cheaper on some platforms. （例如，具有单向通信、非原子存储和data_ready标志的生产者/消费者。）如果没有 LoadStore 排序要求，我可以想象在某些平台上写入器或读取器会更便宜。

Perhaps PowerPC or RISC-V?也许是 PowerPC 或 RISC-V？ I checked what compilers do on Godbolt for a.load(acquire) and a.store(1, release) .我检查了编译器在 Godbolt 上为a.load(acquire)和a.store(1, release)做了什么。

# clang(trunk) for RISC-V -O3
load(std::atomic<int>&):     # acquire
        lw      a0, 0(a0)    # apparently RISC-V just has barriers, not acquire *operations*
        fence   r, rw        # but the barriers do let you block only what is necessary
        ret
store(std::atomic<int>&):    # release
        fence   rw, w
        li      a1, 1
        sw      a1, 0(a0)
        ret

If fence r and/or fence w exist and are ever cheaper than fence r,rw or fence rw, w , then yes, RISC-V can do something slightly cheaper than acq/rel.如果fence r和/或fence w存在并且比fence r,rw或fence rw, w便宜，那么是的，RISC-V 可以做一些比 acq/rel 稍微便宜的事情。 Unless I'm missing something, that would still be strong enough if you just want loads after an acquire load see stores from before a release store, but don't care about LoadStore: Others loads staying before a release store, and others stores staying after an acquire load.除非我遗漏了一些东西，否则如果您只想在获取负载后加载，请查看发布存储之前的存储，但不关心 LoadStore：其他负载停留在发布存储之前，而其他存储停留获取负载后。

CPUs naturally want to load early and store late to hide latencies, so it's usually not much of a burden to actually block LoadStore reordering on top of blocking LoadLoad or StoreStore. CPU 自然希望提前加载和延迟存储以隐藏延迟，因此在阻塞 LoadLoad 或 StoreStore 之上实际阻塞 LoadStore 重新排序通常不是什么负担。 At least that's true for an ISA as long as it's possible to get the ordering you need without having to use a much stronger barrier.至少对于 ISA 来说是这样，只要它可以在不必使用更强大的屏障的情况下获得所需的顺序。 (ie when the only option that meets the minimum requirement is far beyond it, like 32-bit ARMv7 where you'd need a dmb ish full barrier that also blocked StoreLoad.) （即当满足最低要求的唯一选项远远超出它时，例如 32 位 ARMv7，您需要一个dmb ish完整屏障，它也阻止了 StoreLoad。）

https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ - as Preshing notes, LoadStore reordering is usually only useful on cache-miss loads. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ - 正如 Preshing 指出的那样，LoadStore 重新排序通常仅对缓存未命中负载有用。 ("Instruction*" reordering isn't the best way to think of it, though; the important part is the ordering of access to cache. Stores don't access cache until they come out the far end of the store buffer ; executing a store just writes data and address into the store buffer. Loads do access cache when they execute.) （不过，“指令*”重新排序并不是最好的思考方式；重要的部分是对缓存的访问顺序。存储在从存储缓冲区的远端出来之前不会访问缓存；执行store 只是将数据和地址写入存储缓冲区。加载在执行时会访问缓存。）
How does memory reordering help processors and compilers? memory 重新排序如何帮助处理器和编译器？

`release` is free on x86; `release`免费发布； other ISAs are more interesting.其他 ISA 更有趣。

memory_order_release is basically free on x86, only needing to block compile-time reordering. memory_order_release在 x86 上基本免费，只需要阻止编译时重新排序。 (See C++ How is release-and-acquire achieved on x86 only using MOV? - The x86 memory model is program order plus a store-buffer with store forwarding). （参见C++ How is release-and-acquire on x86 only using MOV? - The x86 memory model is program order plus a store-buffer with store forwarding）。

x86 is a silly choice to ask about; x86 是一个愚蠢的选择； something like PowerPC where there are multiple different choices of light-weight barrier would be more interesting.像 PowerPC 这样有多种不同的轻量级屏障选择的东西会更有趣。 Turns out it only needs one barrier each for acquire and release, but seq_cst needs multiple different barriers before and after.事实证明它只需要一个屏障来获取和释放，但 seq_cst 在前后需要多个不同的屏障。

PowerPC asm looks like this for load(acquire) and store(1,release) -对于加载（获取）和存储（1，发布），PowerPC asm 看起来像这样 -

load(std::atomic<int>&):
        lwz %r3,0(%r3)
        cmpw %cr0,%r3,%r3     #; I think for a data dependency on the load
        bne- %cr0,$+4         #; never-taken, if I'm reading this right?
        isync                 #; instruction sync, blocking the front-end until older instructions retire?
        blr
store(std::atomic<int>&):
        li %r9,1
        lwsync               # light-weight sync = LoadLoad + StoreStore + LoadStore.  (But not blocking StoreLoad)
        stw %r9,0(%r3)
        blr

I don't know if isync is always cheaper than lwsync which I'd think would also work there;我不知道isync是否总是比lwsync便宜，我认为它也可以在那里工作； I'd have thought stalling the front-end might be worse than imposing some ordering on loads and stores.我原以为拖延前端可能比对加载和存储强加一些排序更糟糕。

I suspect the reason for the compare-and-branch instead of just isync ( documentation ) is that a load can retire from the back-end ("complete") once it's known to be non-faulting, before the data actually arrives.我怀疑比较和分支而不仅仅是isync （文档）的原因是一旦负载被认为是无故障的，在数据实际到达之前，负载可以从后端退出（“完成”）。

(x86 doesn't do this, but weakly-ordered ISAs do; it's how you get LoadStore reordering on CPUs like ARM, with in-order or out-of-order exec. Retirement goes in program order, but stores can't commit to L1d cache until after they retire. x86 requiring loads to produce a value before they can retire is one way to guarantee LoadStore ordering. How is load->store reordering possible with in-order commit? ) （x86 不会这样做，但弱序 ISA 会这样做；这就是您如何在 ARM 等 CPU 上使用有序或乱序执行程序对 LoadStore 进行重新排序。退役按程序顺序进行，但存储无法提交到 L1d 缓存直到它们退出。x86 要求加载在退出之前产生一个值是保证 LoadStore 排序的一种方法。加载- > 存储重新排序如何通过有序提交实现？ ）

So on PowerPC, the compare into condition-register 0 ( %cr0 ) has a data dependency on the load, can can't execute until the data arrives.所以在 PowerPC 上，条件寄存器 0 ( %cr0 ) 的比较对负载有数据依赖性，在数据到达之前无法执行。 Thus can't complete.从而无法完成。 I don't know why there's also an always-false branch on it.我不知道为什么上面还有一个 always-false 分支。 I think the $+4 branch destination is the isync instruction, in case that matters.我认为$+4分支目的地是isync指令，以防万一。 I wonder if the branch could be omitted if you only need LoadLoad, not LoadStore?如果只需要LoadLoad，不需要LoadStore，是否可以省略分支？ Unlikely.不太可能。

IDK if ARMv7 can maybe block just LoadLoad or StoreStore. IDK 如果 ARMv7 可能只阻止 LoadLoad 或 StoreStore。 If so, that would be a big win over dmb ish , which compilers use because they also need to block LoadStore.如果是这样，那将是对dmb ish的巨大胜利，编译器使用它是因为它们还需要阻止 LoadStore。

Loads cheaper than acquire: `memory_order_consume`加载比获取便宜： `memory_order_consume`

This is the useful hardware feature that ISO C++ doesn't currently expose (because std::memory_order_consume is defined in a way that's too hard for compilers to implement correctly in every corner case, without introducing more barriers. Thus it's deprecated, and compilers handle it the same as acquire ).这是 ISO C++ 当前未公开的有用硬件功能（因为std::memory_order_consume的定义方式对于编译器来说太难了，无法在不引入更多障碍的情况下正确实现每个极端情况。因此它已被弃用，编译器处理它与acquire相同）。

Dependency ordering (on all CPUs except DEC Alpha) makes it safe to load a pointer and deref it without any barriers or special load instructions, and still see the pointed-to data if the writer used a release store.依赖排序（在除 DEC Alpha 之外的所有 CPU 上）使得加载指针和取消引用它变得安全，没有任何障碍或特殊加载指令，并且如果编写者使用发布存储，仍然可以看到指向的数据。

If you want to do something cheaper than ISO C++ acq / rel , the load side is where the savings are on ISAs like POWER and ARMv7.如果您想做一些比 ISO C++ acq / rel更便宜的事情，那么负载端就是在 POWER 和 ARMv7 等 ISA 上节省的地方。 (Not x86; full acquire is free). （不是 x86；完全获取是免费的）。 To a much lesser extent on ARMv8 I think, as ldapr should be cheapish.我认为在 ARMv8 上的程度要小得多，因为ldapr应该很便宜。

See C++11: the difference between memory_order_relaxed and memory_order_consume for more, including a talk from Paul McKenney about how Linux uses plain loads (effectively relaxed ) to make the read side of RCU very very cheap, with no barriers, as long as they're careful to not write code where the compiler can optimize away the data dependency into just a control dependency or nothing.有关更多信息，请参见C++11：memory_order_relaxed 和 memory_order_consume 之间的区别，包括 Paul McKenney 关于 Linux 如何使用普通加载（有效地relaxed ）使 RCU 的读取端非常便宜，没有障碍，只要他们小心不要编写编译器可以将数据依赖性优化为仅控制依赖性或什么都没有的代码。

Also related:还相关：

任何可用的操作/围栏比发布更弱但仍提供同步语义？

问题描述

2 个解决方案

解决方案1
2 2023-01-29 17:24:33

解决方案2
1 2023-01-29 15:53:22

ISO C++国际标准化组织 C++

Asm with weaker guarantees that might be sufficient for some cases具有较弱保证的 Asm 在某些情况下可能就足够了

`release` is free on x86; `release`免费发布； other ISAs are more interesting.其他 ISA 更有趣。

Loads cheaper than acquire: `memory_order_consume`加载比获取便宜： `memory_order_consume`

任何可用的操作/围栏比发布更弱但仍提供同步语义？

问题描述

2 个解决方案

解决方案1 2 2023-01-29 17:24:33

解决方案2 1 2023-01-29 15:53:22

ISO C++国际标准化组织 C++

Asm with weaker guarantees that might be sufficient for some cases具有较弱保证的 Asm 在某些情况下可能就足够了

release is free on x86; release免费发布； other ISAs are more interesting.其他 ISA 更有趣。

Loads cheaper than acquire: memory_order_consume加载比获取便宜： memory_order_consume

解决方案1
2 2023-01-29 17:24:33

解决方案2
1 2023-01-29 15:53:22

`release` is free on x86; `release`免费发布； other ISAs are more interesting.其他 ISA 更有趣。

Loads cheaper than acquire: `memory_order_consume`加载比获取便宜： `memory_order_consume`