How is atomic_flag implemented?

Question

How is atomic_flag is implemented? It feels to me that on x86-64 it is equivalent to atomic_bool anyway, but it is just a guess. Might the x86-64 implementation be any different from arm or x86?

Answer 1

Yeah, on normal CPUs where atomic<bool> and atomic<int> are also lock-free, it's pretty much like atomic<bool> , using the same instructions. (x86 and x86-64 have the same set of atomic operations available.)

You might think that it would always use x86 lock bts or lock btr to set / reset (clear) a single bit, but it can be more efficient to do other things (especially for a function that returns a bool instead of branching on it). The object is a whole byte so you can just store or exchange the whole byte. (And if the ABI guarantees that the value is always 0 or 1 , you don't have to booleanize it before returning the result as a bool )

GCC and clang compile test_and_set to a byte exchange, and clear to a byte store of 0 . We get (nearly) identical asm for atomic_flag test_and_set as f.exchange(true);

#include <atomic>

bool TAS(std::atomic_flag &f) {
    return f.test_and_set();
}

bool TAS_bool(std::atomic<bool> &f) {
    return f.exchange(true);
}


void clear(std::atomic_flag &f) {
    //f = 0; // deleted
    f.clear();
}

void clear_relaxed(std::atomic_flag &f) {
    f.clear(std::memory_order_relaxed);
}

void bool_clear(std::atomic<bool> &f) {
    f = false; // deleted
}

On Godbolt for x86-64 with gcc and clang, and for ARMv7 and AArch64.

## GCC9.2 -O3 for x86-64
TAS(std::atomic_flag&):
        mov     eax, 1
        xchg    al, BYTE PTR [rdi]
        ret
TAS_bool(std::atomic<bool>&):
        mov     eax, 1
        xchg    al, BYTE PTR [rdi]
        test    al, al
        setne   al                      # missed optimization, doesn't need to booleanize to 0/1
        ret
clear(std::atomic_flag&):
        mov     BYTE PTR [rdi], 0
        mfence                          # memory fence to drain store buffer before future loads
        ret
clear_relaxed(std::atomic_flag&):
        mov     BYTE PTR [rdi], 0      # x86 stores are already mo_release, no barrier
        ret
bool_clear(std::atomic<bool>&):
        mov     BYTE PTR [rdi], 0
        mfence
        ret

Note that xchg is also an efficient way to do a seq_cst store on x86-64, usually more efficient than the mov + mfence that gcc uses. Clang uses xchg for all of these (except the relaxed store).

Amusingly, clang re-booleanizes to 0/1 after the xchg in atomic_flag.test_and_set() , but GCC instead does it after atomic<bool> . clang does a weird and al,1 in TAS_bool, which would treat values like 2 as false. It seems totally pointless; the ABI guarantees that a bool in memory is always stored as a 0 or 1 byte.

For ARM, we have ldrexb / strexb exchange retry loops, or just strb + dmb ish for the pure store. Or AArch64 can use stlrb wzr, [x0] for clear or assign-false to do a sequential-release store (of the zero-register) without needing a barrier.

Answer 2

On most/sane architectures an interrupt can happen after or before a hardware instruction is executed. Not "in between" it's execution. So either the instruction "happens" (ie. with "side effects") or does not happen.

For example a 16bit architecture most probably has hardware instructions to operate on 16bit variables with a single instruction. So incrementing a 16bit variable will be a single instruction. Storing a value in a 16bit variable will be a single instruction. Etc. Locking is not needed for 16bit variables, as the increment either happens or does not happen, atomically. It's impossible on this architecture to observe the "mid execution" state of an increment of 16bit variable. It is a single instruction. It can't be interrupted "in between" by any signal and interrupt.

A 16-bit architecture may lack instruction to increment a 64-bit variable in a single instruction. It may need many, many instructions to do operations on 64-bit variables. So operations on std::atomic<uint64_t> need additional synchronization instruction inserted by the compiler to implement it's functionality, to implement synchronization with other std::atomic variables, etc.

But operations on 16bit variables on this architecture are single instructions, the compiler doesn't need to do anything with them, the side effects will always be visible everywhere after the instruction executes.

So atomic_flag is most probably just a variable that has the size of the word on a particular processor. This is so that this processor can operate on this variable with single instructions. In practice that is an int , but int is not guaranteed to correspond to the word size of the processor and accesses int handles are not guaranteed to be atomic. I believe typically atomic_flag is the same as sig_atomic_t from posix ( posix docs ). Additional atomic_flag constraints it's operations to bool -ish like only: clear, set and notify.

How is atomic_flag implemented?

Question

2 answers

solution1
7 ACCPTED 2020-01-05 16:08:56

solution2
0 2020-01-05 16:10:48

How is atomic_flag implemented?

Question

2 answers

solution1 7 ACCPTED 2020-01-05 16:08:56

solution2 0 2020-01-05 16:10:48

solution1
7 ACCPTED 2020-01-05 16:08:56

solution2
0 2020-01-05 16:10:48