简体   繁体   中英

Clang doesn't inline std::atomic::load for loading 64-bit structs

Consider the following code, which uses a std::atomic to atomically load a 64-bit object.

#include <atomic>

struct A {
    int32_t x, y;
};

A f(std::atomic<A>& a) {
    return a.load(std::memory_order_relaxed);
}

With GCC, good things happen, and the following code is generated. ( https://godbolt.org/z/zS53ZF )

f(std::atomic<A>&):
        mov     rax, QWORD PTR [rdi]
        ret

This is exactly what I'd expect, since I see no reason why a 64-bit struct shouldn't be able to be treated like any other 64-bit word in this situation.

With Clang, however, the story is different. Clang generates the following. ( https://godbolt.org/z/d6uqrP )

f(std::atomic<A>&):                     # @f(std::atomic<A>&)
        push    rax
        mov     rsi, rdi
        mov     rdx, rsp
        mov     edi, 8
        xor     ecx, ecx
        call    __atomic_load
        mov     rax, qword ptr [rsp]
        pop     rcx
        ret
        mov     rdi, rax
        call    __clang_call_terminate
__clang_call_terminate:                 # @__clang_call_terminate
        push    rax
        call    __cxa_begin_catch
        call    std::terminate()

This is problematic for me for several reasons:

  1. More obviously, there are far more instructions, so I'd expect the code to be less efficient
  2. Less obviously, notice that the generated code also includes a call to a library function __atomic_load , which means that my binary needs to be linked with libatomic. This means I need different lists of libraries to link depending on whether user's of my code use GCC or Clang.
  3. The library function might use a lock, which would be a performance decrease

The important question on my mind right now is whether there is a way to get Clang to also convert the load into a single instruction. We are using this as part of a library that we plan to distribute to others, so we cannot rely on a particular compiler being used. The solution suggested to me so far is to use type punning and store the struct inside a union alongside a 64-bit int, since Clang does correctly load 64-bit ints atomically in one instruction. I am skeptical of this solution, however, since although it appears to work on all major compilers, I have read that it is in fact undefined behaviour. Such code is also not particularly friendly for others to read and understand if they are not familiar with the trick.

To summarize, is there a way to atomically load a 64-bit struct that:

  1. Works in both Clang and GCC, and preferably most other popular compilers,
  2. Generates a single instruction when compiled,
  3. Is not undefined behaviour,
  4. Is reader friendly?

This clang missed optimization only happens with libstdc++; clang on Godbolt inlines as we expect for -stdlib=libc++ . https://godbolt.org/z/Tt8XTX .

It seems that giving the struct 64-bit alignment is sufficient to hand-hold clang.

libstdc++ 's std::atomic template does that for types that are small enough to be atomic when naturally aligned, but perhaps clang++ is only seeing the alignment of the underlying type, not the class member of atomic<T> , in the libstdc++ implementation. I haven't investigated; someone should report this to the clang / LLVM bugzilla.

#include <atomic>
#include <stdint.h>  // you forgot this header.

struct A {
    alignas(2 * sizeof(int32_t)) int32_t x;
    int32_t y;  // this one must be separate, otherwise y would also be aligned -> 16-byte object
};

A f(std::atomic<A>& a) {
    return a.load(std::memory_order_relaxed);
}

Aligning by the struct size makes it agnostic of alignof(int64_t) , which on a 32-bit ABI might only be 4. (And I didn't use alignas(8) to avoid over-alignment on systems where char is 32-bit and sizeof(int64_t) = 2.) This may be needlessly complicated, and alignas(int64_t) is easier to read, even though it's not always the same thing as giving this struct natural alignment.)

Godbolt

# clang++ 9.0  -std=gnu++17 -O3;  g++ is the same
f(std::atomic<A>&):
        mov     rax, qword ptr [rdi]
        ret

BTW, no, the libatomic library function won't use a lock; it does know that 8-byte aligned loads are naturally atomic and that other use threads will be using plain loads/stores, not locks.

Older clang at least uses call __atomic_load_8 instead of the generic variable-sized one, but that's still a big missed optimization.

Fun fact: clang -m32 will use lock cmpxchg8b to implement an 8-byte atomic load, instead of using SSE or fild like GCC does. :/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM