简体   繁体   English

x86 mfence 和 C++ 内存屏障

[英]x86 mfence and C++ memory barrier

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64.我正在检查编译器如何为 x86_64 上的多核内存屏障发出指令。 The below code is the one I'm testing using gcc_x86_64_8.3 .下面的代码是我正在使用gcc_x86_64_8.3测试的gcc_x86_64_8.3

std::atomic<bool> flag {false};
int any_value {0};

void set()
{
  any_value = 10;
  flag.store(true, std::memory_order_release);
}

void get()
{
  while (!flag.load(std::memory_order_acquire));
  assert(any_value == 10);
}

int main()
{
  std::thread a {set};
  get();
  a.join();
}

When I use std::memory_order_seq_cst , I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3 .当我使用std::memory_order_seq_cst ,我可以看到MFENCE指令与任何优化MFENCE -O1, -O2, -O3 This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).该指令确保刷新存储缓冲区,从而更新 L1D 缓存中的数据(并使用 MESI 协议确保其他线程可以看到效果)。

However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.但是,当我使用没有优化的std::memory_order_release/acquire ,也使用了MFENCE指令,但是使用-O1, -O2, -O3优化省略了该指令,并且没有看到其他刷新缓冲区的指令。

In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?在不使用MFENCE的情况下,如何确保将存储缓冲区数据提交到高速缓存以确保内存顺序语义?

Below is the assembly code for the get/set functions with -O3 , like what we get on the Godbolt compiler explorer :下面是带有-O3的 get/set 函数的汇编代码,就像我们在 Godbolt 编译器资源管理器中得到的一样

set():
        mov     DWORD PTR any_value[rip], 10
        mov     BYTE PTR flag[rip], 1
        ret


.LC0:
        .string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
        .string "any_value == 10"

get():
.L8:
        movzx   eax, BYTE PTR flag[rip]
        test    al, al
        je      .L8
        cmp     DWORD PTR any_value[rip], 10
        jne     .L15
        ret
.L15:
        push    rax
        mov     ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
        mov     edx, 17
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        call    __assert_fail

The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions 1 , which is all what the release semantics require. x86 内存排序模型为所有存储指令1提供了 #StoreStore 和 #LoadStore 屏障,这正是发布语义所需要的。 Also the processor will commit a store instruction as soon as possible;处理器也会尽快提交一条存储指令; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation 2 .当存储指令退出时,存储成为存储缓冲区中最旧的,内核具有处于可写一致性状态的目标缓存线,并且缓存端口可用于执行存储操作2 So there is no need for an MFENCE instruction.所以不需要MFENCE指令。 The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.该标志将尽快对另一个线程可见,当它出现时, any_value保证为 10。

On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers.另一方面,顺序一致性也需要#StoreLoad 和#LoadLoad 屏障。 MFENCE is required to provide both 3 barriers and so it is used at all optimization levels. MFENCE需要提供3 个障碍,因此它用于所有优化级别。

Related: Size of store buffers on Intel hardware?相关: 英特尔硬件上的存储缓冲区大小? What exactly is a store buffer? 究竟什么是存储缓冲区? . .


Footnotes:脚注:

(1) There are exceptions that don't apply here. (1) 有一些例外情况不适用于这里。 In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier.特别是,非临时存储和存储到不可缓存的写入组合内存类型仅提供 #LoadStore 屏障。 Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.无论如何,这些障碍是为在 Intel 和 AMD 处理器上存储回写内存类型提供的。

(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. (2) 这与在某些条件下全局可见的写组合存储形成对比。 See Section 11.3.1 of the Intel manual Volume 3.请参阅英特尔手册第 3 卷的第 11.3.1 节。

(3) See the discussion under Peter's answer. (3) 见彼得回答下的讨论。

x86's TSO memory model is sequential-consistency + a store buffer, so only seq-cst stores need any special fencing. x86 的 TSO 内存模型是顺序一致性 + 存储缓冲区,因此只有 seq-cst 存储需要任何特殊围栏。 (Stalling after a store until the store buffer drains, before later loads, is all we need to recover sequential consistency). (在存储之后停止直到存储缓冲区耗尽,在以后的加载之前,我们需要恢复顺序一致性)。 The weaker acq/rel model is compatible with the StoreLoad reordering caused by a store buffer.较弱的 acq/rel 模型与存储缓冲区引起的 StoreLoad 重新排序兼容。

(See discussion in comments re: whether "allowing StoreLoad reordering" is an accurate and sufficient description of what x86 allows. A core always sees its own stores in program order because loads snoop the store buffer, so you could say that store-forwarding also reorders loads of recently-stored data. Except you can't always: Globally Invisible load instructions ) (请参阅评论中的讨论:“允许 StoreLoad 重新排序”是否对 x86 允许的内容进行了准确而充分的描述。内核始终按程序顺序查看自己的存储,因为加载会监听存储缓冲区,因此您可以说存储转发也重新排序最近存储的数据负载。除非你不能总是: 全局不可见加载指令

(And BTW, compilers other than gcc use xchg to do a seq-cst store. This is actually more efficient on current CPUs. GCC's mov + mfence might have been cheaper in the past, but is currently usually worse even if you don't care about the old value. See Why does a std::atomic store with sequential consistency use XCHG? for a comparison between GCC's mov+mfence vs. xchg . Also my answer on Which is a better write barrier on x86: lock+addl or xchgl? ) (和BTW,编译器比GCC使用其他xchg做SEQ-CST店。这实际上是有效的当前的CPU。GCC的mov + mfence可能已经过去便宜,但目前通常更糟,即使你不关心旧值。请参阅为什么具有顺序一致性的 std::atomic 存储使用 XCHG?以比较 GCC 的mov+mfencexchg 。另外我的回答是关于 x86哪个是更好的写屏障:lock+addl 或xchgl?

Fun fact: you can achieve sequential consistency by instead fencing seq-cst loads instead of stores.有趣的事实:您可以通过屏蔽 seq-cst加载而不是存储来实现顺序一致性。 But cheap loads are much more valuable than cheap stores for most use-cases, so everyone uses ABIs where the full barriers go on the stores.但是对于大多数用例来说,廉价负载比廉价商店更有价值,所以每个人都使用 ABI,因为商店的所有障碍都在那里。

See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for details of how C++11 atomic ops map to asm instruction sequences for x86, PowerPC, ARMv7, ARMv8, and Itanium.有关 C++11 原子操作如何映射到 x86、PowerPC、ARMv7、ARMv8 和 Itanium 的 asm 指令序列的详细信息,请参见https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html Also When are x86 LFENCE, SFENCE and MFENCE instructions required?另外什么时候需要 x86 LFENCE、SFENCE 和 MFENCE 指令?


when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used当我在没有优化的情况下使用 std::memory_order_release/acquire 时,也会使用 MFENCE 指令

That's because flag.store(true, std::memory_order_release);那是因为flag.store(true, std::memory_order_release); doesn't inline, because you disabled optimization.不内联,因为您禁用了优化。 That includes inlining of very simple member functions like atomic::store(T, std::memory_order = std::memory_order_seq_cst)这包括内联非常简单的成员函数,如atomic::store(T, std::memory_order = std::memory_order_seq_cst)

When the ordering parameter to the __atomic_store_n() GCC builtin is a runtime variable (in the atomic::store() header implementation), GCC plays it conservative and promotes it to seq_cst.__atomic_store_n() GCC 内置的排序参数是一个运行时变量(在atomic::store()头实现中)时, GCC 会保守地播放它并将其提升为 seq_cst。

It might actually be worth it for gcc to branch over mfence because it's so expensive, but that's not what we get. gcc 对mfence进行分支实际上可能是值得的,因为它太贵了,但这不是我们得到的。 (But that would make larger code-size for functions with runtime variable order params, and the code path might not be hot. So branching is probably only a good idea in the libatomic implementation, or with profile-guided optimization for rare cases where a function is large enough to not inline but takes a variable order.) (但是对于具有运行时变量顺序参数的函数,这会产生更大的代码大小,并且代码路径可能不会很热。因此分支可能只是 libatomic 实现中的一个好主意,或者在极少数情况下使用配置文件引导优化函数足够大,不能内联,但采用可变顺序。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM