简体   繁体   English

内存栅栏是否涉及内核

[英]Does the memory fence involve the kernel

After asking this question , I've understood that the atomic instruction, such as test-and-set , would not involve the kernel.这个问题后,我明白原子指令,例如test-and-set ,不会涉及内核。 Only if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.只有当进程需要进入睡眠状态(等待获取锁)或唤醒(因为它无法获取锁但现在可以)时,内核才必须参与执行调度操作。

If so, does it mean that the memory fence, such as std::atomic_thread_fence in c++11, won't also involve the kernel?如果是这样,是否意味着内存栅栏(例如 c++11 中的std::atomic_thread_fence也不会涉及内核?

std::atomic doesn't involve the kernel 1 std::atomic 不涉及内核1

On almost all normal CPUs (the kind we program for in real life), memory barrier instructions are unprivileged and get used directly by the compiler.在几乎所有普通 CPU(我们在现实生活中编程的那种)上,内存屏障指令都是无特权的,并且直接由编译器使用。 The same way compilers know how to emit instructions like x86 lock add [rdi], eax for fetch_add (or lock xadd if you use the return value).与编译器知道如何发出指令的方式相同,例如 x86 lock add [rdi], eax for fetch_add (如果使用返回值则lock xadd )。 Or on other ISAs, literally the same barrier instructions they use before/after loads, stores, and RMWs to give the required ordering.或者在其他 ISA 上,它们在加载、存储和 RMW 之前/之后使用相同的屏障指令来提供所需的排序。 https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

On some arbitrary hypothetical hardware and/or compiler, anything is of course possible, even if it would be catastrophically bad for performance.在一些任意假设的硬件和/或编译器上,当然任何事情都是可能的,即使这对性能来说会是灾难性的。

In asm, a barrier just makes this core wait until some previous (program-order) operations are visible to other cores.在 asm 中,屏障只是让这个内核等待,直到其他内核可以看到某些先前的(程序顺序)操作。 It's a purely local operation.这是一个纯粹的本地操作。 (At least, this is how real-word CPUs are designed, so that sequential consistency is recoverable with only local barriers to control local ordering of load and/or store operations. All cores share a coherent view of cache, maintained via a protocol like MESI. Non-coherent shared-memory systems exist, but implementations don't run C++ std::thread across them, and they typically don't run a single-system-image kernel.) (至少,这就是真实 CPU 的设计方式,因此顺序一致性是可恢复的,只有本地障碍来控制加载和/或存储操作的本地排序。所有内核共享缓存的一致视图,通过类似的协议进行维护MESI。存在非一致性共享内存系统,但实现不会跨它们运行 C++ std::thread,并且它们通常不运行单系统映像内核。)

Footnote 1: (Even non-lock-free atomics usually use light-weight locking).脚注 1:(即使是非无锁原子通常也使用轻量级锁定)。

Also, ARM before ARMv7 apparently didn't have proper memory barrier instructions .此外, ARMv7 之前的 ARM 显然没有适当的内存屏障指令 On ARMv6, GCC uses mcr p15, 0, r0, c7, c10, 5 as a barrier.在 ARMv6 上,GCC 使用mcr p15, 0, r0, c7, c10, 5作为屏障。
Before that ( g++ -march=armv5 and earlier), GCC doesn't know what to do and calls __sync_synchronize (a libatomic GCC helper function) which hopefully is implemneted somehow for whatever machine the code is actually running on.在此之前( g++ -march=armv5及更早版本),GCC 不知道该做什么并调用__sync_synchronize (一个 libatomic GCC 辅助函数),希​​望以某种方式为代码实际运行的任何机器实现。 This may involve a system call on a hypothetical ARMv5 multi-core system, but more likely the binary will be running on an ARMv7 or v8 system where the library function can run a dmb ish .可能涉及对假设的 ARMv5 多核系统的系统调用,但更有可能的是二进制文件将在 ARMv7 或 v8 系统上运行,其中库函数可以运行dmb ish Or if it's a single-core system then it could be a no-op, I think.或者,如果它是一个单核系统,那么我认为它可能是一个空操作。 (C++ memory ordering cares about other C++ threads, not about memory order as seen by possible hardware devices / DMA. Normally implementations assume a multi-core system, but this library function might be a case where a single-core only implementation could be used.) (C++ 内存排序关心其他 C++ 线程,而不是可能的硬件设备/DMA 所看到的内存顺序。通常实现假设一个多核系统,但这个库函数可能是一个可以使用单核实现的情况.)


On x86 for example, std::atomic_thread_fence(std::memory_order_seq_cst) compiles to mfence .例如,在 x86 上, std::atomic_thread_fence(std::memory_order_seq_cst)编译为mfence Weaker barriers like std::atomic_thread_fence(std::memory_order_release) only have to block compile-time reordering;较弱的障碍,如std::atomic_thread_fence(std::memory_order_release)只需要阻止编译时重新排序; x86's runtime hardware memory model is already acq/rel (seq-cst + a store buffer). x86 的运行时硬件内存模型已经是 acq/rel(seq-cst + 一个存储缓冲区)。 So there aren't any asm instructions corresponding to the barrier.所以没有任何对应于barrier的asm指令。 (One possible implementation for a C++ library would be GNU C asm("" ::: "memory"); , but GCC/clang do have barrier builtins.) (C++ 库的一种可能实现是 GNU C asm("" ::: "memory"); ,但 GCC/clang 确实有屏障内置。)

std::atomic_signal_fence only ever has to block compile-time reordering , even on weakly-ordered ISAs, because all real-world ISAs guarantee that execution within a single thread sees its own operations as happening in program order. std::atomic_signal_fence只需要阻止编译时重新排序,即使在弱排序的 ISA 上也是如此,因为所有现实世界的 ISA 都保证单个线程内的执行将其自己的操作视为按程序顺序发生。 (Hardware implements this by having loads snoop the store buffer of the current core). (硬件通过让负载监听当前内核的存储缓冲区来实现这一点)。 VLIW and IA-64 EPIC, or other explicit-parallelism ISA mechanisms (like Mill with its delayed-visibility loads), still make it possible for the compiler to generate code that respects any C++ ordering guarantees involving the barrier if an async signal (or interrupt for kernel code) arrives after any instruction. VLIW 和 IA-64 EPIC,或其他显式并行 ISA 机制(如 Mill 及其延迟可见性负载),仍然使编译器能够生成遵守任何 C++ 排序保证的代码,如果异步信号(或内核代码中断)在任何指令之后到达。


You can look at code-gen yourself on the Godbolt compiler explorer :您可以在Godbolt 编译器资源管理器上自己查看代码生成:

#include <atomic>
void barrier_sc(void) {
    std::atomic_thread_fence(std::memory_order_seq_cst);
}

x86: mfence . x86: mfence
POWER: sync .电源: sync
AArch64: dmb ish (full barrier on "inner shareable" coherence domain). AArch64: dmb ish (“内部可共享”一致性域的完全屏障)。
ARM with gcc -mcpu=cortex-a15 (or -march=armv7 ): dmb ish ARM 与gcc -mcpu=cortex-a15 (或-march=armv7 ): -march=armv7 dmb ish
RISC-V: fence iorw,iorw RISC-V: fence iorw,iorw

void barrier_acq_rel(void) {
    std::atomic_thread_fence(std::memory_order_acq_rel);
}

x86: nothing x86:没什么
POWER: lwsync (light-weight sync). POWER: lwsync (轻量级同步)。
AArch64: still dmb ish AArch64:仍然是dmb ish
ARM: still dmb ish ARM:仍然是dmb ish
RISC-V: still fence iorw,iorw RISC-V:仍然fence iorw,iorw

void barrier_acq(void) {
    std::atomic_thread_fence(std::memory_order_acquire);
}

x86: nothing x86:没什么
POWER: lwsync (light-weight sync). POWER: lwsync (轻量级同步)。
AArch64: dmb ishld (load barrier, doesn't have to drain the store buffer) AArch64: dmb ishld (加载屏障,不必排空存储缓冲区)
ARM: still dmb ish , even with -mcpu=cortex-a53 (an ARMv8) :/ ARM:仍然dmb ish ,即使使用-mcpu=cortex-a53 (ARMv8):/
RISC-V: still fence iorw,iorw RISC-V:仍然fence iorw,iorw

In both this question and the referenced one you are mixing:在这个问题和您正在混合的参考问题中:

  • synchronization primitives, in the assembler scope, like cmpxchg and fences汇编程序范围内的同步原语,如cmpxchg和栅栏
  • process/thread synchronizations, like futexes进程/线程同步,如futexes

What does it means "it involves the kernel"? “它涉及内核”是什么意思? I guess you mean "(p)threads synchronizations": the thread is put to sleep and will awoken as soon as the given condition is met by another process/thread.我猜你的意思是“(p)线程同步”:线程被置于睡眠状态,并且一旦另一个进程/线程满足给定条件就会被唤醒。

However, test-and-set primitives like cmpxchg and memory fences are functionalities provided by the microprocessor assembler.但是,像cmpxchg和内存栅栏这样的测试和设置原语是微处理器汇编器提供的功能。 The kernel synchronization primitives are eventually based on them to provide system and processes synchronizations, using shared state in kernel space hidden behind kernel calls.内核同步原语最终基于它们提供系统和进程同步,使用隐藏在内核调用后面的内核空间中的共享状态。

You can look at the futex source to get evidence of it.您可以查看futex 源以获取证据。

But no, memory fences don't involve the kernel: they are translated into simple assembler operations .但是不,内存栅栏不涉及内核:它们被翻译成简单的汇编操作 As the same as cmpxchg.与 cmpxchg 相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM