如果使用 `memory_order_relaxed` 检查它，为什么要使用 `memory_order_seq_cst` 设置停止标志？

Question

Herb Sutter, in his "atomic<> weapons" talk, shows several example uses of atomics, and one of them boils down to following: ( video link , timestamped) Herb Sutter 在他的“原子<>武器”演讲中展示了原子的几个示例用法，其中之一归结为以下内容：（视频链接，时间戳）

A main thread launches several worker threads.一个主线程启动多个工作线程。

Workers check the stop flag:工人检查停止标志：

 while (.stop:load(std:.memory_order_relaxed)) { // Do stuff. }

The main thread eventually does stop = true;主线程最终确实stop = true; (note, using order= seq_cst ), then joins the workers. （注意，使用 order= seq_cst ），然后加入工人。

Sutter explains that checking the flag with order= relaxed is ok, because who cares if the thread stops with a slightly bigger delay. Sutter 解释说使用 order= relaxed检查标志是可以的，因为谁在乎线程是否以稍大的延迟停止。

But why does stop = true;但是为什么stop = true; in the main thread use seq_cst ?在主线程中使用seq_cst ？ The slide says that it's purposefully not relaxed , but doesn't explain why .幻灯片说它是故意不relaxed的，但没有解释原因。

It looks like it would work, possibly with a larger stopping delay.看起来它会起作用，可能会有更大的停止延迟。

Is it a compromise between performance and how fast other threads see the flag?这是性能和其他线程看到标志的速度之间的折衷吗？ Ie since the main thread only sets the flag once, we might as well use the strongest ordering, to get the message across as quickly as possible?即由于主线程只设置了一次标志，我们还不如使用最强的排序，以尽可能快地传递消息？

Answer 1

`mo_relaxed` is fine for both load and store of a `stop` flag `mo_relaxed`适用于`stop`标志的加载和存储

There's also no meaningful latency benefit to stronger memory orders, even if latency of seeing a change to a keep_running or exit_now flag was important.更强大的 memory 订单也没有有意义的延迟优势，即使看到对keep_running或exit_now标志的更改的延迟很重要。

IDK why Herb thinks stop.store shouldn't be relaxed; IDK 为什么 Herb 认为stop.store不应该放松； in his talk, his slides have a comment that says // not relaxed on the assignment, but he doesn't say anything about the store side before moving on to "is it worth it".在他的演讲中，他的幻灯片有一条评论说// not relaxed ，但在继续“是否值得”之前，他没有说任何关于商店方面的事情。

Of course, the load runs inside the worker loop, but the store runs only once, and Herb really likes to recommend sticking with SC unless you have a performance reason that truly justifies using something else.当然，负载在工作循环中运行，但存储只运行一次，Herb 真的很喜欢建议坚持使用 SC，除非你有性能方面的原因可以真正证明使用其他东西是合理的。 I hope that wasn't his only reason;我希望这不是他唯一的原因； I find that unhelpful when trying to understand what memory order would actually be necessary and why.我发现在尝试了解 memory 订单实际上是必要的以及为什么需要时，这无济于事。 But anyway, I think either that or a mistake on his part.但无论如何，我认为要么是他，要么是他的错误。

The ISO C++ standard doesn't say anything about how soon stores become visible or what might influence that, just two should recommendations: Section 6.9.2.3 Forward progress ISO C++ 标准没有说明商店在多长时间内可见或有什么影响，只有两个应该建议：第6.9.2.3 节向前进展

18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time. 18.实现应确保由原子操作或同步操作分配的最后一个值（按修改顺序）将在有限的时间段内对所有其他线程可见。

And 33.5.4 Order and consistency [atomics.order] covering only atomics, not mutexes etc.:和33.5.4 顺序和一致性 [atomics.order]仅涵盖原子，不包括互斥锁等：

11. Implementations should make atomic stores visible to atomic loads within a reasonable amount of time. 11.实现应该使原子存储在合理的时间内对原子负载可见。

Another thread can loop arbitrarily many times before its load actually sees this store value, even if they're both seq_cst , assuming there's no other synchronization of any kind between them.另一个线程可以在其负载实际看到此存储值之前任意循环多次，即使它们都是seq_cst ，假设它们之间没有任何其他类型的同步。 Low inter-thread latency is a performance issue, not correctness / formal guarantee.低线程间延迟是一个性能问题，而不是正确性/正式保证。

And non-infinite inter-thread latency is apparently only a "should" QOI (quality of implementation) issue.而非无限的线程间延迟显然只是一个“应该”的 QOI（实现质量）问题。 :P Nothing in the standard suggests that seq_cst would help on an implementation where store visibility could be delayed indefinitely, although one might guess that could be the case, eg on a hypothetical implementation with explicit cache flushes instead of cache coherency. :P 标准中没有任何内容表明seq_cst将有助于存储可见性可能无限期延迟的实现，尽管有人可能会猜测可能是这种情况，例如在具有显式缓存刷新而不是缓存一致性的假设实现中。 (Although such an implementation is probably not practically usable in terms of performance with CPUs anything like what we have now; every release and/or acquire operation would have to flush the whole cache.) （尽管这样的实现在 CPU 的性能方面可能实际上不可用，就像我们现在所拥有的那样；每次释放和/或获取操作都必须刷新整个缓存。）

On real hardware (which uses some form of MESI cache coherency), different memory orders for store or load don't make stores visible sooner in real time, they just control whether later operations can become globally visible while still waiting for the store to commit from the store buffer to L1d cache.在真实硬件上（使用某种形式的 MESI 缓存一致性），不同的 memory 存储或加载命令不会使存储更快实时可见，它们只是控制以后的操作是否可以在等待存储提交的同时变得全局可见从存储缓冲区到 L1d 缓存。 (After invalidating any other copies of the line.) （在使该行的任何其他副本无效之后。）

Stronger orders, and barriers, don't make things happen sooner in an absolute sense, they just delay other things until they're allowed to happen relative to the store or load.从绝对意义上说，更强的订单和障碍不会让事情发生得更快，它们只会延迟其他事情，直到它们被允许相对于商店或负载发生。 (This is the case on all real-world CPUs AFAIK; they always try to make stores visible to other cores ASAP anyway, so the store buffer doesn't fill up, and （这是所有真实世界的 CPU AFAIK 的情况；无论如何，它们总是试图让其他内核尽快看到存储，因此存储缓冲区不会填满，并且

See also (my similar answers on):另见（我的类似答案）：

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? 除了提供必要的保证之外，硬件 memory 屏障是否可以更快地看到原子操作？
If I don't use fences, how long could it take a core to see another core's writes? 如果我不使用栅栏，一个核心需要多长时间才能看到另一个核心的写入？
memory_order_relaxed and visibility memory_order_relaxed 和可见性
Thread synchronization: How to guarantee visibility of writes (it's a non-issue on current real hardware) 线程同步：如何保证写入的可见性（在当前的真实硬件上不是问题）

The second Q&A is about x86 where commit from the store buffer to L1d cache is in program order.第二个问答是关于 x86 ，其中从存储缓冲区提交到 L1d 缓存是按程序顺序进行的。 That limits how far past a cache-miss store execution can get, and also any possible benefit of putting a release or seq_cst fence after the store to prevent later stores (and loads) from maybe competing for resources.这限制了缓存未命中存储执行的距离，以及在存储之后放置释放或 seq_cst 栅栏以防止以后的存储（和加载）可能竞争资源的任何可能的好处。 (x86 microarchitectures will do RFO (read for ownership) before stores reach the head of the store buffer, and plain loads normally compete for resources to track RFOs we're waiting for a response to.) But these effects are extremely minor in terms of something like exiting another thread; （x86 微架构将在存储到达存储缓冲区的头部之前执行 RFO（读取所有权），并且普通加载通常会竞争资源以跟踪我们正在等待响应的 RFO。）但是这些影响在比如退出另一个线程； only very small scale reordering.只有非常小规模的重新排序。

because who cares if the thread stops with a slightly bigger delay.因为谁在乎线程是否以稍大的延迟停止。

More like, who cares if the thread gets more work done by not making loads/stores after the load wait for the check to complete .更像是，谁在乎线程是否通过在加载等待检查完成后不进行加载/存储来完成更多工作。 (Of course, this work will get discarded if it's in the shadow of aa mis-speculated branch on the load result when we eventually load true .) The cost of rolling back to a consistent state after a branch mispredict is more or less independent of how much already-executed work had happened beyond the mispredicted branch. （当然，当我们最终加载true时，如果它在加载结果的错误推测分支的阴影下，这项工作将被丢弃。）在分支错误预测之后回滚到一致的 state 的成本或多或少独立于在错误预测的分支之外发生了多少已经执行的工作。 And it's a stop flag so the total amount of wasted work costing cache/memory bandwidth for other CPUs is pretty minimal.它是一个stop标志，因此其他 CPU 的缓存/内存带宽浪费的工作总量非常少。

That phrasing makes it sound like an acquire load or release store would actually get the the store seen sooner in absolute real time, rather than just relative to other code in this thread.这种措辞听起来像是acquire加载或release存储实际上会以绝对实时的速度更快地看到存储，而不仅仅是相对于该线程中的其他代码。 (Which is not the case). （事实并非如此）。

The benefit is more instruction-level and memory-level parallelism across loop iterations when the load produces a false .好处是当负载产生false时，循环迭代中的指令级和内存级并行性更高。 And simply avoiding running extra instructions on ISAs where an acquire or especially an SC load needs extra instructions, especially expensive 2-way barrier instructions (like PowerPC isync / sync or especially ARMv7 dmb ish full barrier even for acquire), not like ARMv8 ldapr or x86 mov acquire-load instructions.并且简单地避免在获取或特别是 SC 加载需要额外指令的 ISA 上运行额外指令，尤其是昂贵的 2 路屏障指令（如 PowerPC isync / sync或特别是 ARMv7 dmb ish完全屏障，即使是获取），不像 ARMv8 ldapr或x86 mov获取加载指令。 ( Godbolt ) （神箭）

BTW, Herb is right that the dirty flag can also be relaxed , only because of the thread.join sync between the reader and any possible writer.顺便说一句，Herb 是正确的， dirty标志也可以relaxed ，这只是因为阅读器和任何可能的作者之间的thread.join同步。 Otherwise yeah, release / acquire.否则，是的，释放/获取。

But in this case, dirty only needs to be atomic<> at all because of possible simultaneous writers all storing the same value, which ISO C++ still deems data-race UB.但在这种情况下， dirty只需要是atomic<> ，因为可能同时存在的写入器都存储相同的值，ISO C++ 仍然认为是 data-race UB。 eg because of the theoretical possibility of hardware race-detection that traps on conflicting non-atomic accesses.例如，因为硬件竞争检测的理论可能性会捕获冲突的非原子访问。 (Or a software implementations like clang -fsanitize=thread ) （或者像clang -fsanitize=thread这样的软件实现）

Fun fact: C++20 introduced std::stop_token for use as a stop or keep_running flag.有趣的事实：C++20 引入了std::stop_token用作stop或keep_running标志。

Answer 2

First of all, stop.store(true, mo_relaxed) would be enough in this context.首先， stop.store(true, mo_relaxed)在这种情况下就足够了。

launch_workers()
stop = true;  // not relaxed
join_workers()';

why does stop = true;为什么stop = true; in the main thread use seq_cst?在主线程中使用 seq_cst？

Herb does not mention the reason why he uses mo_seq_cst , but let's look at a few possibilities. Herb 没有提到他使用mo_seq_cst的原因，但让我们看看几种可能性。

Based on the " not relaxed " comment, he is worried that stop.store(true, mo_relaxed) can be re-ordered with launch_workers() or join_workers() .基于“ not relaxed ”的评论，他担心stop.store(true, mo_relaxed)可以用launch_workers()或join_workers()重新排序。
Since launch_workers() is a release operation and join_workers() is an acquire operation, the ordering constraints for both will not prevent the store to move in either direction.由于launch_workers()是一个释放操作，而join_workers()是一个获取操作，因此两者的排序约束不会阻止存储向任一方向移动。
However, it is important to notice that for this scenario, it does not really matter whether the store to stop uses mo_relaxed or mo_seq_cst .但是，重要的是要注意，对于这种情况，要stop的存储是使用mo_relaxed还是mo_seq_cst 。 Even with the strongest ordering, mo_seq_cst (which by the absence of other SC operations is no stronger than mo_release ), the ordering rules still allow the re-ordering with join_workers() .即使使用最强的排序mo_seq_cst （由于没有其他 SC 操作不比mo_release强），排序规则仍然允许使用join_workers()重新排序。
Of course this reordering isn't going to happen, but my point is that stronger ordering contraints on the store isn't going to make a difference.当然，这种重新订购不会发生，但我的观点是，商店中更严格的订购限制不会产生影响。
He could make the argument that a sequentially consistent (SC) store is an advantage since the thread performing the relaxed load will pick up on the new value sooner (an SC store flushes the store buffer).他可以认为顺序一致（SC）存储是一个优势，因为执行宽松负载的线程将更快地获取新值（SC 存储刷新存储缓冲区）。
But this seems hardly relevant because the store is in between creating and joining threads, which is not in a tight loop, or as Herb puts it: " ..is it in a performance-critical region of code where this overhead matters?.. "但这似乎无关紧要，因为存储在创建和加入线程之间，这不是一个紧密的循环，或者正如 Herb 所说：“ ..它是否在代码的性能关键区域中，这种开销很重要？.. "
He also says about the load: " ..you don't care when it arrives.. "他还谈到了负载：“ ......你不在乎它什么时候到达...... ”

We don't know the real reason, but it is possibly based on the programming convention that you don't use explicit ordering parameters (which means mo_seq_cst ), unless it makes a difference, and in this case, as Herb explains, only the relaxed load makes a difference.我们不知道真正的原因，但它可能基于您不使用显式排序参数（这意味着mo_seq_cst ）的编程约定，除非它有所作为，在这种情况下，正如 Herb 解释的那样，只有放松的负载会有所不同。

For example, on the weakly ordered PowerPC platform, a load(mo_seq_cst) uses both the (expensive) sync and (less expensive) isync instructions, a load(mo_acquire) still uses isync and a load(mo_relaxed) uses none of them.例如，在弱序 PowerPC 平台上， load(mo_seq_cst)使用（昂贵的） sync和（更便宜的） isync指令， load(mo_acquire)仍然使用isync ，而load(mo_relaxed)一个都不使用。 In a tight loop, that is a good optimization.在一个紧密的循环中，这是一个很好的优化。
Also worth mentioning is that on the mainstream X86 platform, there is no real difference in performance between load(mo_seq_cst) and load(mo_relaxed)另外值得一提的是，在主流的X86平台上， load(mo_seq_cst)和load(mo_relaxed)在性能上并没有真正的区别

Personally I favor this programming style where ordering parameters are omitted when they don't matter and used when they make a difference.就我个人而言，我喜欢这种编程风格，当排序参数无关紧要时省略它们，而在它们产生影响时使用它们。

stop.store(true); // ordering irrelevant, but uses SC
stop.store(true, memory_order_seq_cst); // store requires SC ordering (which is rare)

It's only a matter of style.. for both stores, the compiler will generate the same assembly.这只是风格问题。对于两个商店，编译器将生成相同的程序集。

如果使用 `memory_order_relaxed` 检查它，为什么要使用 `memory_order_seq_cst` 设置停止标志？

问题描述

2 个解决方案

解决方案1
5 已采纳 2022-01-05 13:22:47

`mo_relaxed` is fine for both load and store of a `stop` flag `mo_relaxed`适用于`stop`标志的加载和存储

解决方案2
0 2022-01-05 15:38:58

如果使用 `memory_order_relaxed` 检查它，为什么要使用 `memory_order_seq_cst` 设置停止标志？

问题描述

2 个解决方案

解决方案1 5 已采纳 2022-01-05 13:22:47

mo_relaxed is fine for both load and store of a stop flag mo_relaxed适用于stop标志的加载和存储

解决方案2 0 2022-01-05 15:38:58

解决方案1
5 已采纳 2022-01-05 13:22:47

`mo_relaxed` is fine for both load and store of a `stop` flag `mo_relaxed`适用于`stop`标志的加载和存储

解决方案2
0 2022-01-05 15:38:58