简体繁体 English

为什么操作系统线程被认为是昂贵的？

[英]Why are OS threads considered expensive?

原文 2012-04-01 13:53:38 4 7 multithreading/ threadpool

There are many solutions geared toward implementing "user-space" threads.有许多面向实现“用户空间”线程的解决方案。 Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.无论是 golang.org goroutines、python 的绿色线程、C# 的异步、erlang 的进程等。这个想法是允许并发编程，即使是单个或有限数量的线程。

What I don't understand is, why are the OS threads so expensive?我不明白的是，为什么操作系统线程如此昂贵？ As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.正如我所见，无论哪种方式，您都必须保存任务（操作系统线程或用户空间线程）的堆栈，它有几十千字节，并且您需要一个调度程序在两个任务之间移动。

The OS provides both of this functions for free.操作系统免费提供这两个功能。 Why should OS threads be more expensive than "green" threads?为什么 OS 线程应该比“绿色”线程更昂贵？ What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?假设每个“任务”都有一个专用的操作系统线程会导致性能下降的原因是什么？

7 个解决方案

I want to amend Tudors answer which is a good starting point. 我想修改都铎王朝的答案，这是一个很好的起点。 There are two main overheads of threads: 线程有两个主要开销：

Starting and stopping them. 启动和停止它们。 Involves creating a stack and kernel objects. 涉及创建堆栈和内核对象。 Involves kernel transitions and global kernel locks. 涉及内核转换和全局内核锁。
Keeping their stack around. 保持他们的堆栈。

(1) is only a problem if you are creating and stopping them all the time. （1）如果你一直在创建和停止它们，这只是一个问题。 This is solved commonly using thread pools. 这通常使用线程池来解决。 I consider this problem to be practically solved. 我认为这个问题实际上已经解决了。 Scheduling a task on a thread pool usually does not involve a trip to the kernel which makes it very fast. 在线程池上调度任务通常不涉及到内核的访问，这使得它非常快。 The overhead is on the order of a few interlocked memory operations and a few allocations. 开销大约是几个互锁的内存操作和一些分配。

(2) This becomes important only if you have many threads (> 100 or so). （2）只有当你有很多线程（> 100左右）时，这才变得很重要。 In this case async IO is a means to get rid of the threads. 在这种情况下，异步IO是一种摆脱线程的方法。 I found that if you don't have insane amounts of threads synchronous IO including blocking is slightly faster than async IO (you read that right: sync IO is faster). 我发现如果你没有疯狂数量的线程，同步IO包括阻塞比async IO稍快（你读得正确：同步IO更快）。

There are many solutions geared toward implementing "user-space" threads. 有许多解决方案适用于实现“用户空间”线程。 Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads. 无论是golang.org goroutines，python的绿色线程，C＃的异步，erlang的进程等。这个想法是允许并发编程，即使是单个或有限数量的线程。

It's an abstraction layer. 这是一个抽象层。 It's easier for many people to grasp this concept and use it more effectively in many scenarios. 对于许多人来说，掌握这个概念并在许多情况下更有效地使用它更容易。 It's also easier for many machines (assuming a good abstraction), since the model moves from width to pull in many cases. 对于许多机器来说也更容易（假设抽象很好），因为在很多情况下模型从宽度移动到拉伸。 With pthreads (as an example), you have all the control. 使用pthreads（作为示例），您拥有所有控件。 With other threading models, the idea is to reuse threads, for the process of creating a concurrent task to be inexpensive, and to use a completely different threading model. 对于其他线程模型，我们的想法是重用线程，创建并发任务的过程便宜，并使用完全不同的线程模型。 It's far easier to digest this model; 消化这个模型要容易得多; there's less to learn and measure, and the results are generally good. 没有什么可以学习和衡量，结果总体上是好的。

What I don't understand is, why are the OS threads so expensive? 我不明白的是，为什么操作系统线程如此昂贵？ As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks. 正如我所看到的，无论哪种方式，你必须保存任务的堆栈（操作系统线程或用户空间线程），这是几十千字节，你需要一个调度程序在两个任务之间移动。

Creating a thread is expensive, and the stack requires memory. 创建线程很昂贵，堆栈需要内存。 As well, if your process is using many threads, then context switching can kill performance. 同样，如果您的进程使用多个线程，那么上下文切换可能会导致性能下降。 So lightweight threading models became useful for a number of reasons. 因此，轻量级线程模型由于多种原因而变得有用。 Creating an OS thread became a good solution for medium to large tasks, ideally in low numbers. 创建OS线程成为中型到大型任务的理想解决方案，理想情况是数量较少。 That's restrictive, and quite time consuming to maintain. 这是限制性的，维护起来非常耗时。

A task/thread pool/userland thread does not need to worry about much of the context switching or thread creation. 任务/线程池/用户态线程不需要担心大部分上下文切换或线程创建。 It's often "reuse the resource when it becomes available, if it's not ready now -- also, determine the number of active threads for this machine". 它通常是“在资源可用时重用资源，如果它现在还没有准备好 - 同时，确定该机器的活动线程数”。

More commmonly (IMO), OS level threads are expensive because they are not used correctly by the engineers - either there are too many and there is a ton of context switching, there is competition for the same set of resources, the tasks are too small. 更常见的（IMO），操作系统级别的线程很昂贵，因为工程师没有正确使用它们 - 要么太多而且有大量的上下文切换，对同一组资源存在竞争，任务太小。 It takes much more time to understand how to use OS threads correctly, and how to apply that best to the context of a program's execution. 理解如何正确使用OS线程以及如何将其最佳地应用于程序执行的上下文需要花费更多的时间。

The OS provides both of this functions for free. 操作系统免费提供这两种功能。

They're available, but they are not free. 它们可用，但它们不是免费的。 They are complex, and very important to good performance. 它们很复杂，对于良好的性能非常重要。 When you create an OS thread, it's given time 'soon' -- all the process' time is divided among the threads. 当你创建一个OS线程时，它很快就会给出时间 - 所有进程的时间都在线程之间分配。 That's not the common case with user threads. 这不是用户线程的常见情况。 The task is often enqueued when the resource is not available. 当资源不可用时，任务通常会排队。 This reduces context switching, memory, and the total number of threads which must be created. 这减少了上下文切换，内存和必须创建的线程总数。 When the task exits, the thread is given another. 当任务退出时，线程被赋予另一个。

Consider this analogy of time distribution: 考虑时间分布的这种类比：

Assume you are at a casino. 假设你在赌场。 There are a number people who want cards. 有很多人想要卡片。
You have a fixed number of dealers. 您有固定数量的经销商。 There are fewer dealers than people who want cards. 经销商比想要卡的人少。
There is not always enough cards for every person at any given time. 在任何给定时间，每个人都没有足够的卡片。
People need all cards to complete their game/hand. 人们需要所有牌来完成他们的比赛/手牌。 They return their cards to the dealer when their game/hand is complete. 当他们的比赛/手牌完成时，他们将牌返还给经销商。

How would you ask the dealers to distribute cards? 您如何要求经销商分发卡片？

Under the OS scheduler, that would be based on (thread) priority. 在OS调度程序下，这将基于（线程）优先级。 Every person would be given one card at a time (CPU time), and priority would be evaluated continually. 每个人每次将获得一张卡（CPU时间），并且将不断评估优先级。

The people represent the task or thread's work. 人们代表任务或线程的工作。 The cards represent time and resources. 卡片代表时间和资源。 The dealers represent threads and resources. 经销商代表线程和资源。

How would you deal fastest if there were 2 dealers and 3 people? 如果有2个经销商和3个人，你会如何处理最快？ and if there were 5 dealers and 500 people? 如果有5个经销商和500个人？ How could you minimize running out of cards to deal? 你怎么能最大限度地减少用完纸牌？ With threads, adding cards and adding dealers is not a solution you can deliver 'on demand'. 使用线程，添加卡片和添加经销商不是“按需”提供的解决方案。 Adding CPUs is equivalent to adding dealers. 添加CPU相当于添加经销商。 Adding threads is equivalent to dealers dealing cards to more people at a time (increases context switching). 添加线程相当于经销商一次向更多人发牌（增加上下文切换）。 There are a number of strategies to deal cards more quickly, especially after you eliminate the people's need for cards in a certain amount of time. 有许多策略可以更快地处理卡片，特别是在您消除了人们在一定时间内对卡片的需求之后。 Would it not be faster to go to a table and deal to a person or people until their game is complete if the dealer to people ratio were 1/50? 如果经销商与人的比例为1/50，那么在他们的游戏完成之前去一张桌子并与一个人或人交易会不会更快？ Compare this to visiting every table based on priority, and coordinating visitation among all dealers (the OS approach). 相比之下，根据优先级访问每个表，并协调所有经销商之间的访问（操作系统方法）。 That's not to imply the OS is stupid -- it implies that creating an OS thread is an engineer adding more people and more tables, potentially more than the dealers can reasonably handle. 这并不意味着操作系统是愚蠢的 - 这意味着创建一个操作系统线程是一个工程师添加更多的人和更多的表，可能比经销商可以合理处理的更多。 Fortunately, the constraints may be lifted in many cases by using other multithreading models and higher abstractions. 幸运的是，在许多情况下可以通过使用其他多线程模型和更高的抽象来解除约束。

Why should OS threads be more expensive than "green" threads? OS线程为什么要比“绿色”线程更昂贵？ What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"? 由于每个“任务”都有专用的OS线程，导致性能下降的原因是什么？

If you developed a performance critical low level threading library (eg upon pthreads), you would recognize the importance of reuse (and implement it in your library as a model available for users). 如果您开发了性能关键的低级别线程库（例如，在pthreads上），您将认识到重用的重要性（并在库中将其作为可供用户使用的模型实现）。 From that angle, the importance of higher level multithreading models is a simple and obvious solution/optimization based on real world usage as well as the ideal that the entry bar for adopting and effectively utilizing multithreading can be lowered. 从这个角度来看，更高级多线程模型的重要性是基于现实世界使用的简单而明显的解决方案/优化，以及可以降低采用和有效利用多线程的入口条的理想。

It's not that they are expensive -- the lightweight threads' model and pool is a better solution for many problems, and a more appropriate abstraction for engineers who do not understand threads well. 并不是说它们很昂贵 - 轻量级线程的模型和池是许多问题的更好解决方案，对于不了解线程的工程师来说更合适的抽象。 The complexity of multithreading is greatly simplified (and often more performant in real world usage) under this model. 在此模型下，多线程的复杂性大大简化（并且在实际使用中通常更具性能）。 With OS threads, you do have more control, but several more considerations must be made to use them as effectively as possible -- heeding these consideration can dramatically reflow a program's execution/implementation. 对于OS线程，您确实拥有更多控制权，但必须考虑更多因素以尽可能有效地使用它们 - 注意这些考虑因素可以极大地重新规划程序的执行/实现。 With higher level abstractions, many of these complexities are minimized by completely altering the flow of task execution (width vs pull). 通过更高级别的抽象，通过完全改变任务执行流程（宽度与拉动），可以最大限度地减少许多复杂性。

Saving the stack is trivial, no matter what its size - the stack pointer needs to be saved in the Thread Info Block in the kernel, (so usualy saving most of the registers as well since they will have been pushed by whatever soft/hard interrupt caused the OS to be entered). 保存堆栈是微不足道的，无论其大小如何 - 堆栈指针需要保存在内核中的线程信息块中（因此通常可以保存大多数寄存器，因为它们将被任何软/硬中断推送导致操作系统进入）。

One issue is that a protection level ring-cycle is required to enter the kernel from user. 一个问题是从用户进入内核需要保护级别的循环周期。 This is an essential, but annoying, overhead. 这是一个必不可少但令人烦恼的开销。 Then the driver or system call has to do whatever was requested by the interrupt and then the scheduling/dispatching of threads onto processors. 然后驱动程序或系统调用必须执行中断请求的任何操作，然后调度/调度线程到处理器上。 If this results in the preemption of a thread from one process by a thread from another, a load of extra process context has to be swapped as well. 如果这导致一个线程从另一个进程抢占一个线程，则还必须交换额外进程上下文的负载。 Even more overhead is added if the OS decides that a thread that is running on another processor core than the one handling the interrupt mut be preempted - the other core must be hardware-interrupted, (this is on top of the hard/soft interrupt that entred the OS in the first place. 如果操作系统决定运行在另一个处理器核心上的线程而不是处理中断mut的线程被抢占，则会增加更多的开销 - 另一个核心必须是硬件中断的（这是在硬/软中断之上）首先是操作系统。

So, a scheduling run may be quite a complex operation. 因此，调度运行可能是一个非常复杂的操作。

'Green threads' or 'fibers' are, (usually), scheduled from user code. 通常根据用户代码安排“绿色线程”或“光纤”。 A context-change is much easier and cheaper than an OS interrupt etc. because no Wagnerian ring-cycle is required on every context-change, process-context does not change and the OS thread running the green thread group does not change. 上下文更改比OS中断等更容易和更便宜，因为在每次上下文更改时都不需要Wagnerian响铃周期，进程上下文不会更改，并且运行绿色线程组的OS线程不会更改。

Since something-for-nothing does not exist, there are problems with green threads. 由于不存在任何东西，绿色线程存在问题。 They ar run by 'real' OS threads. 它们由“真正的”OS线程运行。 This means that if one 'green' thread in a group run by one OS thread makes an OS call that blocks, all green threads in the group are blocked. 这意味着，如果一个OS线程运行的组中的一个“绿色”线程阻止了OS调用，则该组中的所有绿色线程都将被阻止。 This means that simple calls like sleep() have to be 'emulated' by a state-machine that yields to other green threads, (yes, just like re-implementing the OS). 这意味着像sleep（）这样的简单调用必须由一个状态机“模拟”，这个状态机会产生其他绿色线程（是的，就像重新实现操作系统一样）。 Similarly, any inter-thread signalling. 类似地，任何线程间信令。

Also, of course, green threads cannot directly respond to IO signaling, so somewhat defeating the point of having any threads in the first place. 当然，绿色线程也不能直接响应IO信号，因此在某种程度上有点无法解决任何线程的问题。

The problem with starting kernel threads for each small task is that it incurs a non-negligible overhead to start and stop, coupled with the stack size it needs. 为每个小任务启动内核线程的问题是它会产生不可忽略的启动和停止开销，以及它所需的堆栈大小。

This is the first important point: thread pools exist so that you can recycle threads, in order to avoid wasting time starting them as well as wasting memory for their stacks. 这是第一个重点：存在线程池，以便您可以回收线程，以避免浪费时间启动它们以及浪费堆栈的内存。

Secondly, if you fire off threads to do asynchronous I/O, they will spend most of their time blocked waiting for the I/O to complete, thus effectively not doing any work and wasting memory. 其次，如果你发起线程来做异步I / O，他们将花费大部分时间来阻止等待I / O完成，从而有效地不做任何工作并浪费内存。 A much better option is to have a single worker handle multiple async calls (through some under-the-hood scheduling technique, such as multiplexing), thus again saving memory and time. 一个更好的选择是让一个工作者处理多个异步调用（通过一些底层调度技术，例如多路复用），从而再次节省内存和时间。

One thing that makes "green" threads faster than kernel threads is that they are user-space objects, managed by a virtual machine. 使“绿色”线程比内核线程更快的一件事是它们是由虚拟机管理的用户空间对象。 Starting them is a user space call, while starting a thread is a kernel-space call that is much slower. 启动它们是一个用户空间调用，而启动一个线程是一个更慢的内核空间调用。

A person in Google shows an interesting approach. Google中的某个人展示了一种有趣的方法。

According to him, kernel mode switching itself is not the bottleneck, and the core cost happen on SMP scheduler. 据他介绍，内核模式切换本身不是瓶颈，核心成本发生在SMP调度程序上。 And he claims M:N schedule assisted by kernel wouldn't be expensive, and this makes me to expect general M:N threading to be available on every languages. 并且他声称M：由内核辅助的N计划并不昂贵，这使我期望在每种语言上都可以使用通用的M：N线程。

Because the OS.因为操作系统。 Imagine that instead of asking you to clean the house your grandmother has to call the social service that does some paperwork and a week after assigns a social worker for helping her.想象一下，你的祖母不必要求你打扫房子，而是必须打电话给社会服务部门，该部门会做一些文书工作，并在一周后指派一名社会工作者帮助她。 The worker can be called off at any time and replaced with another one, which again takes several days.这名工人可以随时被解雇，换成另一名工人，这同样需要几天时间。

That's pretty ineffective and slow, huh?这是非常低效和缓慢的，是吧？

In this metaphor you are a userland coroutine scheduler, the social service is an OS with its kernel-level thread scheduler, and a social worker is a fully-fledged thread.在这个比喻中，你是一个用户级协程调度器，社会服务是一个带有内核级线程调度器的操作系统，而社会工作者是一个成熟的线程。

I think the two things are in different levels. 我认为这两件事情处于不同的层面。

Thread or Process is an instance of the program which is being executed. Thread或Process是正在执行的程序的实例。 In a process/thread there is much more things in it. 在进程/线程中，还有更多的东西。 Execution stack, opening files, signals, processors status, and a many other things. 执行堆栈，打开文件，信号，处理器状态以及许多其他内容。

Greentlet is different, it is runs in vm. Greentlet与众不同，它以vm运行。 It supplies a light-weight thread. 它提供轻质螺纹。 Many of them supply a pseudo-concurrently (typically in a single or a few OS-level threads). 其中许多提供伪并发（通常在单个或几个OS级别的线程中）。 And often they supply a lock-free method by data-transmission instead of data sharing. 而且他们通常通过数据传输而不是数据共享来提供无锁方法。

So, the two things focus different, so the weight are different. 所以，这两件事的重点不同，所以重量不同。

And In my mind, the greenlet should be finished in the VM not the OS. 在我看来，greenlet应该在VM而不是操作系统中完成。