Can std::atomic be used sometimes instead of std::mutex in C++?

Question

I suppose that std::atomic sometimes can replace usages of std::mutex . But is it always safe to use atomic instead of mutex? Example code:

std::atomic_flag f, ready; // shared

// ..... Thread 1 (and others) ....
while (true) {
    // ... Do some stuff in the beginning ...
    while (f.test_and_set()); // spin, acquire system lock
    if (ready.test()) {
        UseSystem(); // .... use our system for 50-200 nanoseconds ....
    }
    f.clear(); // release lock
    // ... Do some stuff at the end ...
}

// ...... Thread 2 .....
while (true) {
    // ... Do some stuff in the beginning ...
    InitSystem();
    ready.test_and_set(); // signify system ready
    // .... sleep for 10-30 milli-seconds ....
    while (f.test_and_set()); // acquire system lock
    ready.clear(); // signify system shutdown
    f.clear(); // release lock
    DeInitSystem(); // finalize/destroy system
    // ... Do some stuff at the end ...
}

Here I use std::atomic_flag to protect use of my system (some complex library). But is it safe code? Here I suppose that if ready is false then system is not available and I can't use it and if it is true then it is available and I can use it. For simplicity suppose that code above doesn't throw exceptions.

Of cause I can use std::mutex to protect read/modify of my system. But right now I need very high performance code in Thread-1 that should use atomics very often instead of mutexes (Thread-2 can be slow and use mutexes if needed).

In Thread-1 system-usage code (inside while loop) is run very often, each iteration around 50-200 nano-seconds . So using extra mutexes will be to heavy. But Thread-2 iterations are quite large, as you can see in each iteration of while loop when system is ready it sleeps for 10-30 milli-seconds , so using mutexes only in Thread-2 is quite alright.

Thread-1 is example of one thread, there are several threads running same (or very similar) code as Thread-1 in my real project.

I'm concerned about memory operations ordering, meaning that it can probably happen somtimes that system is not yet in fully consistent state (not yet inited fully) when ready becomes true in Thread-1. Also it may happen that ready becomes false in Thread-1 too late, when system already made some destroying (deinit) operations. Also as you can see system can be inited/destroyed many times in a loop of Thread-2 and used many times in Thread-1 whenever it is ready .

Can my task be solved somehow without std::mutex and other heavy stuff in Thread-1? Only using std::atomic (or std::atomic_flag). Thread-2 can use heavy synchronization stuff if needed, mutexes etc.

Basically Thread-2 should somehow propagate whole inited state of system to all cores and other threads before ready becomes true and also Thread-2 should propagate ready equal to false before any single small operation of system destruction (deinit) is done. By propagating state I mean that all system's inited data should be 100% written consistently to global memory and caches of other core, so that other threads see fully consistent system whenever ready is true .

It is even allowed to make small (milliseconds) pause after system init and before ready is set to true if it improves situation and guarantees. And also it is allowed to make pause after ready is set to false and before starting system destruction (deinit). Also doing some expensive CPU operations in Thread-2 is also alright if there exist some operations like "propagate all Thread-2 writes to global memory and caches to all other CPU cores and threads".

Update : As a solution for my question above right now in my project I decided to use next code with std::atomic_flag to replace std::mutex :

std::atomic_flag f = ATOMIC_FLAG_INIT; // shared
// .... Later in all threads ....
while (f.test_and_set(std::memory_order_acquire)) // try acquiring
    std::this_thread::yield();
shared_value += 5; // Any code, it is lock-protected.
f.clear(std::memory_order_release); // release

This solution above runs 9 nanoseconds on average (measured 2^25 operations) in single thread (release compiled) on my Windows 10 64-bit 2Ghz 2-core laptop. While using std::unique_lock<std::mutex> lock(mux); for same protection purpose takes 100-120 nanoseconds on same Windows PC. If it is needed for threads to spinlock instead of sleeping while waiting then instead of std::this_thread::yield(); in code above I just use semicolon ; . Full online example of usage and time measurements.

Answer 1

I'll ignore your code for the sake of the answer, the answer generally is yes.

A lock does the following things:

allows only one thread to acquire it at any given time
when the lock is acquired, a read barrier is placed
right before the lock is released, a write barrier is placed

The combination of the 3 points above makes the critical section thread safe. only one thread can touch the shared memory, all changes are observed by the locking thread because of the read barrier, and all the changes are to be visible to other locking threads, because of the write barrier.

Can you use atomics to achieve it? Yes, And real life locks (provided for example, by Win32/Posix) ARE implemented by either using atomics and lock free programming, either by using locks that use atomics and lock free programing.

Now, realistically speaking, should you use a self-written lock instead of the standard locks? Absolutely not.

Many concurrency tutorials preserve the notion that spin-locks are "more efficient" than regular locks. I can't stress enough how foolish it is. A user-mode spinlock IS NEVER more efficient than a lock that the OS provides. The reason is simple, that OS locks are wired to the OS scheduler. So if a lock tries to lock a lock and fails - the OS knows to freeze this thread and not reschedule it to run until the lock has been released.

With user-mode spinlocks, this doesn't happen. The OS can't know that the relevant thread tries to acquire to the lock in a tight loop. Yielding is just a patch and not a solution - we want to spin for a short time, then go to sleep until the lock is released. With user mode spin locks, we might waste the entire thread quantum trying to lock the spinlock and yielding.

I will say, for the sake of honesty, that recent C++ standards do give us the ability to sleep on an atomic waiting for it to change its value. So we can, in a very lame way, implement our own "real" locks that try to spin for a while and then sleep until the lock is released. However, implementing a correct and efficient lock when you're not a concurrency expert is pretty much impossible.

My own philosophical opinion that in 2021, developers should rarely deal with those very low-level concurrency topics. Leave those things to the kernel guys. Use some high level concurrency library and focus on the product you want to develop rather than micro-optimizing your code. This is concurrency, where correctness >>> efficiency.

A related rant by Linus Torvalds

Can std::atomic be used sometimes instead of std::mutex in C++?

Question

1 answers

solution1
6 ACCPTED 2021-02-14 11:01:02

Can std::atomic be used sometimes instead of std::mutex in C++?

Question

1 answers

solution1 6 ACCPTED 2021-02-14 11:01:02

solution1
6 ACCPTED 2021-02-14 11:01:02