简体繁体 English

与 `std::mutex` 同步是否比与 `std::atomic(memory_order_seq_cst)` 同步慢？

[英]Is synchronizing with `std::mutex` slower than with `std::atomic(memory_order_seq_cst)`?

原文 2013-04-30 20:20:49 4 3 c++/ c++11/ concurrency/ memory-model/ compare-and-swap

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst , isn't this just as expensive?在互斥体上使用原子的主要原因是互斥体很昂贵，但是atomics的默认内存模型是memory_order_seq_cst ，这不是同样昂贵吗？

Question: Can concurrent a program using locks be as fast as concurrent lock-free program?问题：并发使用锁的程序可以和并发无锁程序一样快吗？

If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.如果是这样，除非我想将memory_order_acq_rel用于原子，否则可能不值得付出努力。

Edit: I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too.编辑：我可能遗漏了一些东西，但基于锁的不能比无锁更快，因为每个锁也必须是一个完整的内存屏障。 But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.但是对于无锁，可以使用比内存屏障限制更少的技术。

So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model ?所以回到我的问题，无锁是否比基于默认memory_model新 C++11 标准的锁快？

Is "lock-free >= lock-based when measured in performance" true? “以性能衡量时无锁>=基于锁”是真的吗？ Let's assume 2 hardware threads.让我们假设有 2 个硬件线程。

Edit 2: My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.编辑 2：我的问题不是关于进度保证，也许我在上下文之外使用“无锁”。

Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.基本上，当您有 2 个共享内存的线程，并且您需要的唯一保证是，如果一个线程正在写入，则另一个线程无法读取或写入，我的假设是一个简单的原子compare_and_swap操作将比锁定一个快得多互斥体。

Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.因为如果一个线程甚至从未接触过共享内存，您最终会无缘无故地一遍遍地锁定和解锁，但是原子操作每次只使用 1 个 CPU 周期。

In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.关于评论，当争用很少时，自旋锁与互斥锁是非常不同的。

3 个解决方案

Lockfree programming is about progress guarantees : From strongest to weakest, those are wait-free , lock-free , obstruction-free , and blocking . Lockfree编程有关进度保证：从最强到最弱，这些都是无等待，无锁，阻碍自由，并阻止。

A guarantee is expensive and comes at a price.保证是昂贵的并且是有代价的。 The more guarantees you want, the more you pay.您想要的保证越多，您支付的就越多。 Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest.通常，阻塞算法或数据结构（例如具有互斥锁）具有最大的自由度，因此可能是最快的。 A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.另一个极端的无等待算法必须在每一步都使用原子操作，这可能会慢得多。

Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject.获取锁实际上相当便宜，因此在没有深入了解该主题的情况下，您不必担心。 Moreover, blocking algorithms with mutexes are much easier to read, write and reason about.此外，使用互斥锁的阻塞算法更容易阅读、编写和推理。 By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.相比之下，即使是最简单的无锁数据结构也是长期专注研究的结果，每个研究都值得一个或多个博士学位。

In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput.简而言之，无锁或无等待算法用最差延迟换取平均延迟和吞吐量。 Everything is slower, but nothing is ever very slow.一切都变慢了，但没有什么是非常缓慢的。 This is a very special characteristic that is only useful in very specific situations (like real-time systems).这是一个非常特殊的特性，仅在非常特定的情况下（如实时系统）才有用。

A lock tends to require more operations than a simple atomic operation does.与简单的原子操作相比，锁往往需要更多的操作。 In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock).在最简单的情况下，memory_order_seq_cst 的速度大约是锁定的两倍，因为锁定在其实现中往往需要至少两个原子操作（一个锁定，一个解锁）。 In many cases, it takes even more than that.在许多情况下，它需要的甚至更多。 However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.但是，一旦您开始利用内存顺序，速度就会快得多，因为您愿意接受较少的同步。

Also, you'll often see "locking algorithms are always as fast as lock free algorithms."此外，您经常会看到“锁定算法总是与无锁算法一样快”。 This is somewhat true.这有点真实。 The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm!基本思想是，如果最快的算法恰好是无锁的，那么没有无锁保证的最快的算法也是相同的算法！ However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.然而，如果最快的算法需要锁，那么那些要求无锁保证的人必须去寻找一个更慢的算法。

In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps.通常，您会在一些低级算法中看到无锁算法，其中利用专用操作码的性能有所帮助。 In almost all other code, locking is more than satisfactory performance, and much easier to read.在几乎所有其他代码中，锁定比令人满意的性能更令人满意，而且更容易阅读。

Question: Can concurrent a program using locks be as fast as concurrent lock-free program?问题：并发使用锁的程序可以和并发无锁程序一样快吗？

It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.它可以更快：无锁算法必须始终将全局状态保持在一致的状态，并且在不知道它们是否有效的情况下进行计算，因为在计算完成时状态可能已经改变，使其变得无关紧要，并且会丢失 CPU循环。

The lock free strategy makes the serialization happen at the end of the process, when the calculation is done .无锁策略使序列化发生在进程结束时，即计算完成时。 In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.在病理情况下，许多线程可以做一个努力，只有一个努力会有成效，其他努力将重试。

Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that.无锁会导致某些线程饥饿，无论它们的优先级是什么，而且没有办法避免这种情况。 (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.) （尽管除非有疯狂的争用，否则线程不太可能饿死重试很长时间。）

On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex).另一方面，“基于序列化计算和一系列副作用”（又名基于锁）的算法在知道它们不会被其他参与者阻止对特定锁定资源进行操作之前不会启动（保证由使用提供互斥锁）。 Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.请注意，如果使用多个锁，它们可能会因需要访问另一个资源而无法完成，从而在设计不良的程序中需要多个锁时可能会导致死锁。

Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).请注意，这个死锁问题不在无锁代码的范围内，它甚至不能作用于多个实体：它通常不能基于两个不相关的对象（1）进行原子提交。

So the lack of chance of dead lock for lock free code is sign of weakness of lock free code : not being able to dead lock is a limit of your tool.因此，无锁代码缺乏死锁机会是无锁代码弱点的标志：无法死锁是您工具的限制。 A system that can only hold of lock at a time also wouldn't be able to dead lock.一次只能持有锁的系统也不会死锁。

The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms.与基于锁的算法的范围相比，无锁算法的范围很小。 For a lot of problem, lock free doesn't even make sense.对于很多问题，无锁甚至没有意义。

A lock based algorithm is polite , the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread .基于锁的算法是有礼貌的，线程在做他们需要做的事情之前必须排队等待：这在每个线程的计算步骤方面效率最高。 But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.但是必须在等待列表中排队线程是低效的：它们通常无法使用时间片的末尾，因此它可能非常低效，因为有人试图在一直被电话打断的情况下进行认真的工作：他的注意力没有了，他永远无法达到最高效率，因为他的工作时间被切成小块。

(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line). (1) 您至少需要能够为此执行双 CAS，即对两个任意地址进行原子操作（不是双字 CAS，它只是更多位的 CAS，可以轻松实现）直到作为缓存线的自然 CPU 内存访问仲裁单元）。