简体   繁体   中英

Does the memory fence involve the kernel

After asking this question , I've understood that the atomic instruction, such as test-and-set , would not involve the kernel. Only if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.

If so, does it mean that the memory fence, such as std::atomic_thread_fence in c++11, won't also involve the kernel?

std::atomic doesn't involve the kernel 1

On almost all normal CPUs (the kind we program for in real life), memory barrier instructions are unprivileged and get used directly by the compiler. The same way compilers know how to emit instructions like x86 lock add [rdi], eax for fetch_add (or lock xadd if you use the return value). Or on other ISAs, literally the same barrier instructions they use before/after loads, stores, and RMWs to give the required ordering. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

On some arbitrary hypothetical hardware and/or compiler, anything is of course possible, even if it would be catastrophically bad for performance.

In asm, a barrier just makes this core wait until some previous (program-order) operations are visible to other cores. It's a purely local operation. (At least, this is how real-word CPUs are designed, so that sequential consistency is recoverable with only local barriers to control local ordering of load and/or store operations. All cores share a coherent view of cache, maintained via a protocol like MESI. Non-coherent shared-memory systems exist, but implementations don't run C++ std::thread across them, and they typically don't run a single-system-image kernel.)

Footnote 1: (Even non-lock-free atomics usually use light-weight locking).

Also, ARM before ARMv7 apparently didn't have proper memory barrier instructions . On ARMv6, GCC uses mcr p15, 0, r0, c7, c10, 5 as a barrier.
Before that ( g++ -march=armv5 and earlier), GCC doesn't know what to do and calls __sync_synchronize (a libatomic GCC helper function) which hopefully is implemneted somehow for whatever machine the code is actually running on. This may involve a system call on a hypothetical ARMv5 multi-core system, but more likely the binary will be running on an ARMv7 or v8 system where the library function can run a dmb ish . Or if it's a single-core system then it could be a no-op, I think. (C++ memory ordering cares about other C++ threads, not about memory order as seen by possible hardware devices / DMA. Normally implementations assume a multi-core system, but this library function might be a case where a single-core only implementation could be used.)


On x86 for example, std::atomic_thread_fence(std::memory_order_seq_cst) compiles to mfence . Weaker barriers like std::atomic_thread_fence(std::memory_order_release) only have to block compile-time reordering; x86's runtime hardware memory model is already acq/rel (seq-cst + a store buffer). So there aren't any asm instructions corresponding to the barrier. (One possible implementation for a C++ library would be GNU C asm("" ::: "memory"); , but GCC/clang do have barrier builtins.)

std::atomic_signal_fence only ever has to block compile-time reordering , even on weakly-ordered ISAs, because all real-world ISAs guarantee that execution within a single thread sees its own operations as happening in program order. (Hardware implements this by having loads snoop the store buffer of the current core). VLIW and IA-64 EPIC, or other explicit-parallelism ISA mechanisms (like Mill with its delayed-visibility loads), still make it possible for the compiler to generate code that respects any C++ ordering guarantees involving the barrier if an async signal (or interrupt for kernel code) arrives after any instruction.


You can look at code-gen yourself on the Godbolt compiler explorer :

#include <atomic>
void barrier_sc(void) {
    std::atomic_thread_fence(std::memory_order_seq_cst);
}

x86: mfence .
POWER: sync .
AArch64: dmb ish (full barrier on "inner shareable" coherence domain).
ARM with gcc -mcpu=cortex-a15 (or -march=armv7 ): dmb ish
RISC-V: fence iorw,iorw

void barrier_acq_rel(void) {
    std::atomic_thread_fence(std::memory_order_acq_rel);
}

x86: nothing
POWER: lwsync (light-weight sync).
AArch64: still dmb ish
ARM: still dmb ish
RISC-V: still fence iorw,iorw

void barrier_acq(void) {
    std::atomic_thread_fence(std::memory_order_acquire);
}

x86: nothing
POWER: lwsync (light-weight sync).
AArch64: dmb ishld (load barrier, doesn't have to drain the store buffer)
ARM: still dmb ish , even with -mcpu=cortex-a53 (an ARMv8) :/
RISC-V: still fence iorw,iorw

In both this question and the referenced one you are mixing:

  • synchronization primitives, in the assembler scope, like cmpxchg and fences
  • process/thread synchronizations, like futexes

What does it means "it involves the kernel"? I guess you mean "(p)threads synchronizations": the thread is put to sleep and will awoken as soon as the given condition is met by another process/thread.

However, test-and-set primitives like cmpxchg and memory fences are functionalities provided by the microprocessor assembler. The kernel synchronization primitives are eventually based on them to provide system and processes synchronizations, using shared state in kernel space hidden behind kernel calls.

You can look at the futex source to get evidence of it.

But no, memory fences don't involve the kernel: they are translated into simple assembler operations . As the same as cmpxchg.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM