简体   繁体   English

无锁多线程适合真正的线程专家

[英]Lock-free multi-threading is for real threading experts

I was reading through an answer that Jon Skeet gave to a question and in it he mentioned this:我正在阅读Jon Skeet对一个问题的回答,他在其中提到了这一点:

As far as I'm concerned, lock-free multi-threading is for real threading experts, of which I'm not one.就我而言,无锁多线程适用于真正的线程专家,我不是其中之一。

Its not the first time that I have heard this, but I find very few people talking about how you actually do it if you are interested in learning how to write lock-free multi-threading code.这不是我第一次听到这个,但我发现如果你对学习如何编写无锁多线程代码感兴趣,很少有人谈论你实际上是如何做的。

So my question is besides learning all you can about threading, etc where do you start trying to learn to specifically write lock-free multi-threading code and what are some good resources.所以我的问题是除了学习所有关于线程等的知识之外,你还从哪里开始尝试学习专门编写无锁多线程代码以及有哪些好的资源。

Cheers干杯

Current "lock-free" implementations follow the same pattern most of the time:当前的“无锁”实现大部分时间都遵循相同的模式:

  • read some state and make a copy of it *读取一些状态并复制它*
  • modify copy *修改副本*
  • do an interlocked operation进行联锁操作
  • retry if it fails如果失败重试

(*optional: depends on the data structure/algorithm) (*可选:取决于数据结构/算法)

The last bit is eerily similar to a spinlock.最后一点与自旋锁非常相似。 In fact, it is a basic spinlock .事实上,它是一个基本的自旋锁 :) :)
I agree with @nobugz on this: the cost of the interlocked operations used in lock-free multi-threading is dominated by the cache and memory-coherency tasks it must carry out .我同意@nobugz 的观点:无锁多线程中使用的互锁操作的成本主要由它必须执行的缓存和内存一致性任务决定

What you gain however with a data structure that is "lock-free" is that your "locks" are very fine grained .然而,使用“无锁”数据结构所获得的是“锁”是非常细粒度的 This decreases the chance that two concurrent threads access the same "lock" (memory location).这减少了两个并发线程访问同一个“锁”(内存位置)的机会。

The trick most of the time is that you do not have dedicated locks - instead you treat eg all elements in an array or all nodes in a linked list as a "spin-lock".大多数时候的诀窍是你没有专用的锁——相反,你将例如数组中的所有元素或链表中的所有节点视为“自旋锁”。 You read, modify and try to update if there was no update since your last read.如果自上次阅读以来没有更新,您可以阅读、修改并尝试更新。 If there was, you retry.如果有,请重试。
This makes your "locking" (oh, sorry, non-locking :) very fine grained, without introducing additional memory or resource requirements.这使您的“锁定”(哦,抱歉,非锁定 :) 非常细粒度,而不会引入额外的内存或资源要求。
Making it more fine-grained decreases the probability of waits.使其更细粒度会降低等待的可能性。 Making it as fine-grained as possible without introducing additional resource requirements sounds great, doesn't it?在不引入额外资源需求的情况下使其尽可能细粒度听起来很棒,不是吗?

Most of the fun however can come from ensuring correct load/store ordering .然而,大部分乐趣可以来自确保正确的加载/存储排序
Contrary to one's intuitions, CPUs are free to reorder memory reads/writes - they are very smart, by the way: you will have a hard time observing this from a single thread.与直觉相反,CPU 可以自由地对内存读/写重新排序——顺便说一下,它们非常聪明:您将很难从单个线程中观察到这一点。 You will, however run into issues when you start to do multi-threading on multiple cores.但是,当您开始在多个内核上进行多线程处理时,您会遇到问题。 Your intuitions will break down: just because an instruction is earlier in your code, it does not mean that it will actually happen earlier.你的直觉会崩溃:仅仅因为指令在你的代码中较早,并不意味着它实际上会更早发生。 CPUs can process instructions out of order: and they especially like to do this to instructions with memory accesses, to hide main memory latency and make better use of their cache. CPU 可以无序处理指令:他们特别喜欢对具有内存访问的指令执行此操作,以隐藏主内存延迟并更好地利用其缓存。

Now, it is sure against intuition that a sequence of code does not flow "top-down", instead it runs as if there was no sequence at all - and may be called "devil's playground".现在,可以肯定的是,代码序列不会“自上而下”流动,而是像根本没有序列一样运行 - 并且可以称为“魔鬼的操场”。 I believe it is infeasible to give an exact answer as to what load/store re-orderings will take place.我认为对于将发生什么加载/存储重新排序给出确切的答案是不可行的。 Instead, one always speaks in terms of mays and mights and cans and prepare for the worst.相反,人们总是说在玉米方面,并不妨易拉罐做最坏的打算。 "Oh, the CPU might reorder this read to come before that write, so it is best to put a memory barrier right here, on this spot." “哦,CPU可能会将该读取重新排序到该写入之前,所以最好在此处放置一个内存屏障,在这个位置。”

Matters are complicated by the fact that even these mays and mights can differ across CPU architectures.事情是由事实,即使这些玉米不妨可以在CPU架构不同复杂。 It might be the case, for example, that something that is guaranteed to not happen in one architecture might happen on another.可能是这种情况,例如,一些是保证没有一个架构发生可能发生在另一个上。


To get "lock-free" multi-threading right, you have to understand memory models.要正确使用“无锁”多线程,您必须了解内存模型。
Getting the memory model and guarantees correct is not trivial however, as demonstrated by this story, whereby Intel and AMD made some corrections to the documentation of MFENCE causing some stir-up among JVM developers .然而,获得正确的内存模型和保证MFENCE ,正如这个故事所证明的那样,英特尔和 AMD 对MFENCE的文档进行了一些更正,这在 JVM 开发人员中引起了一些骚动 As it turned out, the documentation that developers relied on from the beginning was not so precise in the first place.事实证明,开发人员从一开始就依赖的文档并不那么精确。

Locks in .NET result in an implicit memory barrier, so you are safe using them (most of the time, that is... see for example this Joe Duffy - Brad Abrams - Vance Morrison greatness on lazy initialization, locks, volatiles and memory barriers. :) (Be sure to follow the links on that page.) .NET 中的锁会导致隐式内存屏障,因此您可以安全地使用它们(大多数情况下,也就是说……参见Joe Duffy - Brad Abrams - Vance Morrison在延迟初始化、锁、易失性和内存方面的伟大之处)障碍。:)(请务必点击该页面上的链接。)

As an added bonus, you will get introduced to the .NET memory model on a side quest .作为额外奖励,您将在支线任务中了解 .NET 内存模型 :) :)

There is also an "oldie but goldie" from Vance Morrison: What Every Dev Must Know About Multithreaded Apps .还有来自 Vance Morrison 的“oldie but goldie”: What Every Dev Must Know About Multithreaded Apps

...and of course, as @Eric mentioned, Joe Duffy is a definitive read on the subject. ...当然,正如@Eric提到的, Joe Duffy是关于这个主题的权威读物

A good STM can get as close to fine-grained locking as it gets and will probably provide a performance that is close to or on par with a hand-made implementation.一个好的 STM 可以尽可能接近细粒度锁定,并且可能会提供接近或与手工实现相当的性能。 One of them is STM.NET from the DevLabs projects of MS.其中之一是STM.NETDevLabs项目MS的。

If you are not a .NET-only zealot, Doug Lea did some great work in JSR-166 .如果您不是只使用 .NET 的狂热者, Doug Lea 在 JSR-166 中做了一些很棒的工作
Cliff Click has an interesting take on hash tables that does not rely on lock-striping - as the Java and .NET concurrent hash tables do - and seem to scale well to 750 CPUs. Cliff Click对哈希表有一个有趣的看法,它不依赖于锁条——就像 Java 和 .NET 并发哈希表那样——并且似乎可以很好地扩展到 750 个 CPU。

If you are not afraid to venture into Linux territory, the following article provides more insight into the internals of current memory architectures and how cache-line sharing can destroy performance: What every programmer should know about memory .如果您不害怕涉足 Linux 领域,以下文章将提供有关当前内存架构内部结构以及缓存行共享如何破坏性能的更多见解:每个程序员都应该了解的内存知识

@Ben made many comments about MPI: I sincerely agree that MPI may shine in some areas. @Ben 对 MPI 发表了很多评论:我真诚地同意 MPI 可能会在某些领域大放异彩。 An MPI based solution can be easier to reason about, easier to implement and less error-prone than a half-baked locking implementation that tries to be smart.与试图变得智能的半生不熟的锁定实现相比,基于 MPI 的解决方案可以更容易推理、更容易实现且不易出错。 (It is however - subjectively - also true for an STM based solution.) I would also bet that it is light-years easier to correctly write a decent distributed application in eg Erlang, as many successful examples suggest. (然而 - 主观上 - 对于基于 STM 的解决方案也是如此。)我还敢打赌,正如许多成功的例子所表明的那样,在例如 Erlang 中正确编写一个像样的分布式应用程序要容易几光年。

MPI, however has its own costs and its own troubles when it is being run on a single, multi-core system .然而,当 MPI 在单核、多核系统上运行时,它有其自身的成本和问题。 Eg in Erlang, there are issues to be solved around the synchronization of process scheduling and message queues .例如在 Erlang 中,围绕进程调度和消息队列同步有一些问题需要解决。
Also, at their core, MPI systems usually implement a kind of cooperative N:M scheduling for "lightweight processes".此外,在其核心,MPI 系统通常为“轻量级进程”实现一种协作N:M 调度 This for example means that there is an inevitable context switch between lightweight processes.例如,这意味着轻量级进程之间不可避免地存在上下文切换。 It is true that it is not a "classic context switch" but mostly a user space operation and it can be made fast - however I sincerely doubt that it can be brought under the 20-200 cycles an interlocked operation takes .确实,它不是“经典的上下文切换”,而主要是用户空间操作,并且可以快速进行 - 但是我真诚地怀疑它是否可以将其置于联锁操作所需20-200 个周期内 User-mode context switching is certainly slower even in the the Intel McRT library.即使在 Intel McRT 库中,用户模式上下文切换肯定也较慢 N:M scheduling with light-weight processes is not new.使用轻量级进程进行 N:M 调度并不新鲜。 LWPs were there in Solaris for a long time. LWP 在 Solaris 中存在很长时间了。 They were abandoned.他们被遗弃了。 There were fibers in NT. NT中有纤维。 They are mostly a relic now.他们现在大多是遗物。 There were "activations" in NetBSD. NetBSD 中有“激活”。 They were abandoned.他们被遗弃了。 Linux had its own take on the subject of N:M threading. Linux 对 N:M 线程有自己的看法。 It seems to be somewhat dead by now.它现在似乎有些死了。
From time to time, there are new contenders: for example McRT from Intel , or most recently User-Mode Scheduling together with ConCRT from Microsoft.不时有新的竞争者出现:例如来自英特尔的 McRT ,或者最近的User-Mode Scheduling与来自 Microsoft 的ConCRT
At the lowest level, they do what an N:M MPI scheduler does.在最低级别,它们执行 N:M MPI 调度程序所做的工作。 Erlang - or any MPI system -, might benefit greatly on SMP systems by exploiting the new UMS . Erlang - 或任何 MPI 系统 - 可能会通过利用新的UMS使 SMP 系统受益匪浅。

I guess the OP's question is not about the merits of and subjective arguments for/against any solution, but if I had to answer that, I guess it depends on the task: for building low level, high performance basic data structures that run on a single system with many cores , either low-lock/"lock-free" techniques or an STM will yield the best results in terms of performance and would probably beat an MPI solution any time performance-wise, even if the above wrinkles are ironed out eg in Erlang.我想 OP 的问题不是关于任何解决方案的优点和主观论据,但如果我必须回答这个问题,我想这取决于任务:用于构建运行在具有多核的单个系统,无论是低锁定/“无锁定”技术还是 STM 都将在性能方面产生最佳结果,并且在任何时候都可能在性能方面击败 MPI 解决方案,即使上述问题得到解决例如在 Erlang 中。
For building anything moderately more complex that runs on a single system, I would perhaps choose classic coarse-grained locking or if performance is of great concern, an STM.为了构建在单个系统上运行的任何稍微复杂的东西,我可能会选择经典的粗粒度锁定,或者如果性能非常重要,则选择 STM。
For building a distributed system, an MPI system would probably make a natural choice.对于构建分布式系统,MPI 系统可能是一个自然的选择。
Note that there are MPI implementations for .NET as well (though they seem to be not as active).请注意, .NET也有MPI 实现(尽管它们似乎不那么活跃)。

Joe Duffy's book:乔·达菲的书:

http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html

He also writes a blog on these topics.他还写了一篇关于这些主题的博客。

The trick to getting low-lock programs right is to understand at a deep level precisely what the rules of the memory model are on your particular combination of hardware, operating system, and runtime environment.正确使用低锁程序的诀窍是在深层次上准确理解内存模型的规则在您的硬件、操作系统和运行时环境的特定组合上是什么。

I personally am not anywhere near smart enough to do correct low-lock programming beyond InterlockedIncrement, but if you are, great, go for it.我个人还不够聪明,无法在 InterlockedIncrement 之外进行正确的低锁编程,但如果你是,那就去吧。 Just make sure that you leave lots of documentation in the code so that people who are not as smart as you don't accidentally break one of your memory model invariants and introduce an impossible-to-find bug.只要确保您在代码中留下大量文档,这样那些不如您聪明的人就不会意外破坏您的内存模型不变量之一并引入一个无法找到的错误。

There is no such thing as "lock-free threading" these days.现在没有“无锁线程”这样的东西。 It was an interesting playground for academia and the like, back in the end of the last century when computer hardware was slow and expensive.对于学术界等来说,这是一个有趣的游乐场,早在上世纪末,当时计算机硬件又慢又贵。 Dekker's algorithm was always my favorite, modern hardware has put it out to pasture. Dekker 的算法一直是我最喜欢的,现代硬件已经把它放牧了。 It doesn't work anymore.它不再起作用了。

Two developments have ended this: the growing disparity between the speed of RAM and the CPU.两个发展已经结束了这种情况:RAM 和 CPU 速度之间的差距越来越大。 And the ability of chip manufacturers to put more than one CPU core on a chip.以及芯片制造商在一个芯片上放置多个 CPU 内核的能力。

The RAM speed problem required the chip designers to put a buffer on the CPU chip. RAM 速度问题要求芯片设计者在 CPU 芯片上放置一个缓冲区。 The buffer stores code and data, quickly accessible by the CPU core.缓冲区存储代码和数据,可由 CPU 内核快速访问。 And can be read and written from/to RAM at a much slower rate.并且可以以更慢的速度从/向RAM读取和写入。 This buffer is called the CPU cache, most CPUs have at least two of them.这个缓冲区称为 CPU 缓存,大多数 CPU 至少有两个。 The 1st level cache is small and fast, the 2nd is big and slower.一级缓存小而快,二级缓存大而慢。 As long as the CPU can read data and instructions from the 1st level cache it will run fast.只要CPU可以从一级缓存中读取数据和指令,它就会运行得很快。 A cache miss is really expensive, it puts the CPU to sleep for as many as 10 cycles if the data is not in the 1st cache, as many as 200 cycles if it isn't in the 2nd cache and it needs to be read from RAM.缓存未命中非常昂贵,如果数据不在第一个缓存中,它会使 CPU 休眠多达 10 个周期,如果数据不在第二个缓存中,则多达 200 个周期并且需要从中读取内存。

Every CPU core has its own cache, they store their own "view" of RAM.每个 CPU 内核都有自己的缓存,它们存储自己的 RAM“视图”。 When the CPU writes data the write is made to cache which is then, slowly, flushed to RAM.当 CPU 写入数据时,将写入缓存,然后缓慢刷新到 RAM。 Inevitable, each core will now have a different view of the RAM contents.不可避免的是,每个内核现在对 RAM 内容都有不同的看法。 In other words, one CPU doesn't know what another CPU has written until that RAM write cycle completed and the CPU refreshes its own view.换句话说,一个 CPU 不知道另一个 CPU 写了什么,直到 RAM 写周期完成并且CPU 刷新自己的视图。

That is dramatically incompatible with threading.这与线程非常不兼容。 You always really care what the state of another thread is when you must read data that was written by another thread.当您必须读取另一个线程写入的数据时,您总是非常关心另一个线程的状态。 To ensure this, you need to explicitly program a so-called memory barrier.为了确保这一点,您需要显式编程一个所谓的内存屏障。 It is a low-level CPU primitive that ensures that all CPU caches are in a consistent state and have an up to date view of RAM.它是一种低级 CPU 原语,可确保所有 CPU 缓存都处于一致状态并具有最新的 RAM 视图。 All pending writes have to flushed to RAM, the caches then need to be refreshed.所有挂起的写入都必须刷新到 RAM,然后需要刷新缓存。

This is available in .NET, the Thread.MemoryBarrier() method implements one.这在 .NET 中可用,Thread.MemoryBarrier() 方法实现了一个。 Given that this is 90% of the job that the lock statement does (and 95+% of the execution time), you are simply not ahead by avoiding the tools that .NET gives you and trying to implement your own.鉴于这是 lock 语句完成的 90% 的工作(以及 95% 以上的执行时间),避免使用 .NET 提供的工具并尝试实现自己的工具,您根本就没有领先。

Google for lock free data structures andsoftware transactional memory .谷歌用于无锁数据结构软件事务内存

I'll agree with John Skeet on this one;我同意约翰斯基特的观点; lock-free threading is the devil's playground, and best left up to people who know that they know what they need to know.无锁线程是魔鬼的游乐场,最好留给知道自己需要知道什么的人。

Even though lock-free threading may be difficult in .NET, often you can make significant improvements when using a lock by studying exactly what needs to be locked, and minimizing the locked section... this is also known as minimizing the lock granularity .尽管在 .NET 中实现无锁线程可能很困难,但通常您可以通过研究需要锁定的内容并最小化锁定部分来在使用锁时做出重大改进……这也称为最小化锁粒度

As an example, just say you need to make a collection thread safe.例如,只需说您需要使集合线程安全。 Don't just blindly throw a lock around a method iterating over the collection if it performs some CPU-intensive task on each item.如果迭代集合的方法对每个项目执行一些 CPU 密集型任务,请不要盲目地锁定它。 You might only need to put a lock around creating a shallow copy of the collection.可能只需要锁定创建集合的浅表副本。 Iterating over the copy could then work without a lock.迭代副本然后可以在没有锁的情况下工作。 Of course this is highly dependent on the specifics of your code, but I have been able to fix a lock convoy issue with this approach.当然,这在很大程度上取决于您的代码的具体情况,但我已经能够使用这种方法修复锁车队问题。

When it comes to multi-threading you have to know exactly what you are doing.当涉及到多线程时,您必须确切地知道自己在做什么。 I mean explore all the possible scenarios/cases that might occur when you are working in a multi-threaded environment.我的意思是探索在多线程环境中工作时可能发生的所有可能的场景/案例。 Lock-free multithreading is not a library or a class which we incorporate, its a knowledge/experience that we earn during our journey on threads.无锁多线程不是我们合并的库或类,它是我们在线程之旅中获得的知识/经验。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM