为什么 std::mutex 比 std::atomic 快？

Question

I want to put objects in std::vector in multi-threaded mode.我想在多线程模式下将对象放入std::vector中。 So I decided to compare two approaches: one uses std::atomic and the other std::mutex .所以我决定比较两种方法：一种使用std::atomic ，另一种使用std::mutex 。 I see that the second approach is faster than the first one.我看到第二种方法比第一种方法快。 Why?为什么？

I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.我使用 GCC 4.8.1，在我的机器（8 个线程）上，我看到第一个解决方案需要391502微秒，而第二个解决方案需要175689微秒。

#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>

int main(int argc, char* argv[]) {
    const size_t size = 1000000;
    std::vector<int> first_result(size);
    std::vector<int> second_result(size);
    std::atomic<bool> sync(false);

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            while(sync.exchange(true)) {
                std::this_thread::yield();
            };
            first_result[counter] = counter;
            sync.store(false) ;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        std::mutex mutex; 
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            std::unique_lock<std::mutex> lock(mutex);       
            second_result[counter] = counter;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    return 0;
}

Answer 1

I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be.我认为您的问题不能仅参考标准互斥锁就可以回答，因为它们尽可能依赖于平台。 However, there is one thing, that should be mentioned.然而，有一件事，应该提到。

Mutexes are not slow.互斥锁并不慢。 You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.您可能已经看过一些文章，将它们的性能与自定义自旋锁和其他“轻量级”东西进行了比较，但这不是正确的方法 - 这些不能互换。

Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).自旋锁相当快，当它们在相对较短的时间内被锁定（获取）时 - 获取它们非常便宜，但是其他也在尝试锁定的线程在整个时间内都处于活动状态（在循环中不断运行） .

Custom spin-lock could be implemented this way:自定义自旋锁可以通过这种方式实现：

class SpinLock
{
private:
    std::atomic_flag _lockFlag;

public:
    SpinLock()
    : _lockFlag {ATOMIC_FLAG_INIT}
    { }

    void lock()
    {
        while(_lockFlag.test_and_set(std::memory_order_acquire))
        { }
    }

    bool try_lock()
    {
        return !_lockFlag.test_and_set(std::memory_order_acquire);
    }

    void unlock()
    {
        _lockFlag.clear();
    }
};

Mutex is a primitive, that is much more complicated. Mutex是一种原始类型，它要复杂得多。 In particular, on Windows, we have two such primitives - Critical Section , that works in per-process basis and Mutex , which doesn't have such limitation.特别是，在 Windows 上，我们有两个这样的原语 - 临界区，它在每个进程的基础上工作，而互斥体则没有这样的限制。

Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.锁定互斥体（或临界区）的成本要高得多，但操作系统有能力真正让其他等待线程“休眠”，从而提高性能并帮助任务调度程序进行有效的资源管理。

Why I write this?我为什么写这个？ Because modern mutexes are often so-called "hybrid mutexes".因为现代互斥体通常是所谓的“混合互斥体”。 When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.当这种互斥锁被锁定时，它的行为就像一个普通的自旋锁——其他等待线程执行一定数量的“自旋”，然后锁定重互斥锁以防止浪费资源。

In your case, mutex is locked in each loop iteration to perform this instruction:在您的情况下，互斥锁在每个循环迭代中被锁定以执行此指令：

second_result[counter] = omp_get_thread_num();

It looks like a fast one, so "real" mutex may never be locked.它看起来像一个快速的，所以“真正的”互斥锁可能永远不会被锁定。 That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).这意味着，在这种情况下，您的“互斥锁”可以与基于原子的解决方案一样快（因为它本身就变成了基于原子的解决方案）。

Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment.此外，在第一个解决方案中，您使用了某种类似自旋锁的行为，但我不确定这种行为在多线程环境中是否可预测。 I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op.我很确定，“锁定”应该具有acquire语义，而解锁是release操作。 Relaxed memory ordering may be too weak for this use case.对于这个用例， Relaxed内存排序可能太弱了。

I edited the code to be more compact and correct.我编辑了代码以使其更加紧凑和正确。 It uses the std::atomic_flag , which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).它使用std::atomic_flag ，这是唯一的类型（与std::atomic<>不同），它保证是无锁的（即使std::atomic<bool>也不会给你）。

Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements.另外，请参阅下面关于“不屈服”的评论：这是特定情况和要求的问题。 Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior.自旋锁是多线程编程中非常重要的一部分，通常可以通过稍微修改其行为来提高其性能。 For example, Boost library implements spinlock::lock() as follows:例如，Boost 库实现了spinlock::lock()如下：

void lock()
{
    for( unsigned k = 0; !try_lock(); ++k )
    {
        boost::detail::yield( k );
    }
}

source: boost/smart_ptr/detail/spinlock_std_atomic.hpp来源： boost/smart_ptr/detail/spinlock_std_atomic.hpp

Where detail::yield() is (Win32 version):其中detail::yield()是（Win32 版本）：

inline void yield( unsigned k )
{
    if( k < 4 )
    {
    }
#if defined( BOOST_SMT_PAUSE )
    else if( k < 16 )
    {
        BOOST_SMT_PAUSE
    }
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
    else if( k < 32 )
    {
        Sleep( 0 );
    }
    else
    {
        Sleep( 1 );
    }
#else
    else
    {
        // Sleep isn't supported on the Windows Runtime.
        std::this_thread::yield();
    }
#endif
}

[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp] [来源： http : //www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]

First, thread spins for some fixed number of times (4 in this case).首先，线程旋转固定的次数（在这种情况下为 4）。 If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful.如果互斥锁仍然被锁定，则使用pause指令（如果可用）或调用Sleep(0) ，这基本上会导致上下文切换并允许调度程序给另一个被阻塞的线程做一些有用的事情的机会。 Then, Sleep(1) is called to perform actual (short) sleep.然后，调用Sleep(1)来执行实际的（短的）睡眠。 Very nice!非常好！

Also, this statement:另外，这个声明：

The purpose of a spinlock is busy waiting自旋锁的目的是忙等待

is not entirely true.不完全正确。 The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind.自旋锁的目的是作为一种快速、易于实现的锁原语——但它仍然需要正确编写，并考虑到某些可能的情况。 For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock() ):例如，英特尔说（关于 Boost 使用_mm_pause()作为在lock()内产生的方法）：

In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain.在自旋等待循环中，内部暂停提高了代码检测锁释放的速度，并提供了特别显着的性能提升。

So, implementations like void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }所以，像void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); } void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); } may not be as good as it seems. void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }可能不像看起来那么好。

Answer 2

There is an additional important issue related to your problem.还有一个与您的问题相关的重要问题。 An efficient spinlock never "spins" on an operation that involves a store (such as exchange or test_and_set ).高效的自旋锁永远不会在涉及存储（例如exchange或test_and_set ）的操作上“旋转”。 On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time).在典型的现代体系结构中，这些操作生成的指令要求具有锁定内存位置的缓存线处于独占状态，这非常耗时（尤其是当多个线程同时旋转时）。 Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.始终在加载/只读时自旋，并仅在此操作有可能成功时才尝试获取锁。

A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++例如，一篇很好的相关文章是：在 C++ 中正确实现自旋锁

为什么 std::mutex 比 std::atomic 快？

问题描述

2 个解决方案

解决方案1
33 已采纳 2015-04-09 09:29:58

解决方案2
0 2021-05-13 10:36:18

为什么 std::mutex 比 std::atomic 快？

问题描述

2 个解决方案

解决方案1 33 已采纳 2015-04-09 09:29:58

解决方案2 0 2021-05-13 10:36:18

解决方案1
33 已采纳 2015-04-09 09:29:58

解决方案2
0 2021-05-13 10:36:18