Why is std::mutex faster than std::atomic?

Question

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex . I see that the second approach is faster than the first one. Why?

I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.

#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>

int main(int argc, char* argv[]) {
    const size_t size = 1000000;
    std::vector<int> first_result(size);
    std::vector<int> second_result(size);
    std::atomic<bool> sync(false);

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            while(sync.exchange(true)) {
                std::this_thread::yield();
            };
            first_result[counter] = counter;
            sync.store(false) ;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        std::mutex mutex; 
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            std::unique_lock<std::mutex> lock(mutex);       
            second_result[counter] = counter;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    return 0;
}

Answer 1

I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.

Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.

Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).

Custom spin-lock could be implemented this way:

class SpinLock
{
private:
    std::atomic_flag _lockFlag;

public:
    SpinLock()
    : _lockFlag {ATOMIC_FLAG_INIT}
    { }

    void lock()
    {
        while(_lockFlag.test_and_set(std::memory_order_acquire))
        { }
    }

    bool try_lock()
    {
        return !_lockFlag.test_and_set(std::memory_order_acquire);
    }

    void unlock()
    {
        _lockFlag.clear();
    }
};

Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section , that works in per-process basis and Mutex , which doesn't have such limitation.

Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.

Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.

In your case, mutex is locked in each loop iteration to perform this instruction:

second_result[counter] = omp_get_thread_num();

It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).

Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.

I edited the code to be more compact and correct. It uses the std::atomic_flag , which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).

Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:

void lock()
{
    for( unsigned k = 0; !try_lock(); ++k )
    {
        boost::detail::yield( k );
    }
}

source: boost/smart_ptr/detail/spinlock_std_atomic.hpp

Where detail::yield() is (Win32 version):

inline void yield( unsigned k )
{
    if( k < 4 )
    {
    }
#if defined( BOOST_SMT_PAUSE )
    else if( k < 16 )
    {
        BOOST_SMT_PAUSE
    }
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
    else if( k < 32 )
    {
        Sleep( 0 );
    }
    else
    {
        Sleep( 1 );
    }
#else
    else
    {
        // Sleep isn't supported on the Windows Runtime.
        std::this_thread::yield();
    }
#endif
}

[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]

First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!

Also, this statement:

The purpose of a spinlock is busy waiting

is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock() ):

In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain.

So, implementations like void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); } void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); } may not be as good as it seems.

Answer 2

There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves a store (such as exchange or test_and_set ). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.

A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

Why is std::mutex faster than std::atomic?

Question

2 answers

solution1
33 ACCPTED 2015-04-09 09:29:58

solution2
0 2021-05-13 10:36:18

Why is std::mutex faster than std::atomic?

Question

2 answers

solution1 33 ACCPTED 2015-04-09 09:29:58

solution2 0 2021-05-13 10:36:18

solution1
33 ACCPTED 2015-04-09 09:29:58

solution2
0 2021-05-13 10:36:18