简体繁体 English

要创建多少个线程以及何时创建？

[英]How many threads to create and when?

原文 2009-02-04 07:52:10 8 7 c++/ linux/ multithreading/ networking

I have a networking Linux application which receives RTP streams from multiple destinations, does very simple packet modification and then forwards the streams to the final destination. 我有一个联网Linux应用程序，它接收来自多个目的地的RTP流，进行非常简单的数据包修改，然后将流转发到最终目的地。

How do I decide how many threads I should have to process the data? 如何确定处理数据所需的线程数？ I suppose, I cannot open a thread for each RTP stream as there could be thousands. 我想，我无法为每个RTP流打开一个线程，因为可能有数千个。 Should I take into account the number of CPU cores? 我应该考虑CPU核心的数量吗？ What else matters? 还有什么重要的？ Thanks. 谢谢。

7 个解决方案

It is important to understand the purpose of using multiple threads on a server; 了解在服务器上使用多个线程的目的很重要; many threads in a server serve to decrease latency rather than to increase speed. 服务器中的许多线程用于减少延迟而不是提高速度。 You don't make the cpu more faster by having more threads but you make it more likely a thread will always appear at within a given period to handle a request. 通过拥有更多线程，您不会使cpu更快，但是您更有可能在给定时间段内始终出现一个线程来处理请求。

Having a bunch of threads which just move data in parallel is a rather inefficient shot-gun (Creating one thread per request naturally just fails completely). 拥有一堆只是并行移动数据的线程是一个相当低效的射击枪（每个请求创建一个线程自然就完全失败了）。 Using the thread pool pattern can be a more effective, focused approach to decreasing latency. 使用线程池模式可以是一种更有效，更集中的方法来减少延迟。

Now, in the thread pool, you want to have at least as many threads as you have CPUs/cores. 现在，在线程池中，您希望拥有至少与CPU /核心一样多的线程。 You can have more than this but the extra threads will again only decrease latency and not increase speed. 你可以拥有更多，但额外的线程将再次只减少延迟而不是提高速度。

Think the problem of organizing server threads as akin to organizing a line in a super market. 将组织服务器线程的问题视为类似于在超市中组织线路的问题。 Would you like to have a lot of cashiers who work more slowly or one cashier who works super fast? 您想让很多收银员工作得更慢，还是一位收银员工作超快？ The problem with the fast cashier isn't speed but rather that one customer with a lot of groceries might still take up a lot of their time. 快速收银员的问题不在于速度，而是一个拥有大量杂货的顾客可能仍会占用大量的时间。 The need for many threads comes from the possibility that a few request that will take a lot of time and block all your threads. 对许多线程的需求来自于一些请求会占用大量时间并阻塞所有线程的可能性。 By this reasoning, whether you benefit from many slower cashiers depends on whether your have the same number of groceries or wildly different numbers. 通过这种推理，您是否从许多较慢的收银员中受益取决于您是否拥有相同数量的杂货还是截然不同的数字。 Getting back to the basic model, what this means is that you have to play with your thread number to figure what is optimal given the particular characteristics of your traffic, looking at the time taken to process each request. 回到基本模型，这意味着您必须使用您的线程编号来确定在给定流量的特定特征时最佳情况，并查看处理每个请求所花费的时间。

Classically the number of reasonable threads is depending on the number of execution units, the ratio of IO to computation and the available memory. 经典地，合理线程的数量取决于执行单元的数量，IO与计算的比率和可用内存。

Number of Execution Units ( `XU` ) 执行单位数（ `XU` ）

That counts how many threads can be active at the same time. 这会计算同时有多少个线程处于活动状态。 Depending on your computations that might or might not count stuff like hyperthreads -- mixed instruction workloads work better. 根据您的计算可能会或可能不会计算超线程等内容 - 混合指令工作负载更好。

Ratio of IO to Computation ( `%IO` ) IO与计算的比率（ `%IO` ）

If the threads never wait for IO but always compute (%IO = 0), using more threads than XUs only increase the overhead of memory pressure and context switching. 如果线程永远不会等待IO但总是计算（％IO = 0），则使用比XU更多的线程只会增加内存压力和上下文切换的开销。 If the threads always wait for IO and never compute (%IO = 1) then using a variant of poll() or select() might be a good idea. 如果线程总是等待IO并且从不计算（％IO = 1），则使用poll()或select()的变体可能是个好主意。

For all other situations XU / %IO gives an approximation of how many threads are needed to fully use the available XUs. 对于所有其他情况， XU / %IO给出了完全使用可用GPU所需的线程数的近似值。

Available Memory ( `Mem` ) 可用内存（ `Mem` ）

This is more of a upper limit. 这更像是一个上限。 Each thread uses a certain amount of system resources ( MemUse ). 每个线程使用一定量的系统资源（ MemUse ）。 Mem / MemUse gives you an approximation of how many threads can be supported by the system. Mem / MemUse为您提供系统可支持的线程数的近似值。

Other Factors 其他因素

The performance of the whole system can still be constrained by other factors even if you can guess or (better) measure the numbers above. 即使您可以猜测或（更好地）测量上面的数字，整个系统的性能仍然可能受到其他因素的限制。 For example, there might be another service running on the system, which uses some of the XUs and memory. 例如，可能在系统上运行另一个服务，该服务使用一些XU和内存。 Another problem is general available IO bandwidth ( IOCap ). 另一个问题是一般可用的IO带宽（ IOCap ）。 If you need less computing resources per transferred byte than your XUs provide, obviously you'll need to care less about using them completely and more about increasing IO throughput. 如果每个传输的字节所需的计算资源少于您提供的XU，那么显然您需要更少关注完全使用它们以及更多关于提高IO吞吐量的信息。

For more about this latter problem, see this Google Talk about the Roofline Model . 有关后一个问题的更多信息，请参阅此Google Talk有关Roofline模型的信息。

I'd say, try using just ONE thread; 我会说，尝试只使用一个线程; it makes programming much easier. 它使编程更容易。 Although you'll need to use something like libevent to multiplex the connections, you won't have any unexpected synchronisation issues. 虽然您需要使用libevent之类的东西来复用连接，但您不会遇到任何意外的同步问题。

Once you've got a working single-threaded implementation, you can do performance testing and make a decision on whether a multi-threaded one is necessary. 一旦有了可用的单线程实现，就可以进行性能测试，并决定是否需要多线程实现。

Even if a multithreaded implementation is necessary, it may be easier to break it into several processes instead of threads (ie not sharing address space; either fork() or exec multiple copies of the process from a parent) if they don't have a lot of shared data. 即使需要多线程实现，也可能更容易将其分解为多个进程而不是线程（即不共享地址空间; fork（）或执行来自父进程的多个进程副本）如果它们没有很多共享数据。

You could also consider using something like Python's "Twisted" to make implementation easier (this is what it's designed for). 您还可以考虑使用Python的“Twisted”之类的东西来简化实现（这就是它的设计）。

Really there's probably not a good case for using threads over processes - but maybe there is in your case, it's difficult to say. 对于在进程中使用线程可能不是一个好例子 - 但是在你的情况下可能很难说。 It depends how much data you need to share between threads. 这取决于您需要在线程之间共享多少数据。

I would look into a thread pool for this application. 我会查看这个应用程序的线程池。

http://threadpool.sourceforge.net/ http://threadpool.sourceforge.net/

Allow the thread pool to manage your threads and the queue. 允许线程池管理线程和队列。

You can tweak the maximum and minimum number of threads used based on performance profiling later. 您可以稍后根据性能分析调整最大和最小线程数。

Listen to the people advising you to use libevent (or OS specific utilities such as epoll/kqueue). 聆听建议您使用libevent（或OS特定实用程序，如epoll / kqueue）的人。 In the case of many connections this is an absolute must because, like you said, creating threads will be an enormous perfomance hit, and select() also doesn't quite cut it. 在许多连接的情况下，这是绝对必须的，因为像你说的那样，创建线程将是一个巨大的性能命中，而select（）也不会完全削减它。

Let your program decide. 让你的程序决定。 Add code to it that measures throughput and increases/decreases the number of threads dynamically to maximize it. 添加代码来测量吞吐量并动态增加/减少线程数以最大化它。

This way, your application will always perform well, regardless of the number of execution cores and other factors 这样，无论执行核心数量和其他因素如何，您的应用程序始终都能正常运行

It is a good idea to avoid trying to create one (or even N) threads per client request. 避免尝试为每个客户端请求创建一个（甚至N个）线程是个好主意。 This approach is classically non-scalable and you will definitely run into problems with memory usage or context switching. 这种方法通常是不可扩展的，你肯定会遇到内存使用或上下文切换的问题。 You should look at using a thread pool approach instead and look at the incoming requests as tasks for any thread in the pool to handle. 您应该查看使用线程池方法，并查看传入请求作为池中任何线程要处理的任务。 The scalability of this approach is then limited by the ideal number of threads in the pool - usually this is related to the number of CPU cores. 然后，此方法的可伸缩性受池中理想线程数的限制 - 通常这与CPU核心数相关。 You want to try to have each thread use exactly 100% of the CPU on a single core - so in the ideal case you would have 1 thread per core, this will reduce context switching to zero. 您希望尝试让每个线程在单个核心上使用100％的CPU - 因此在理想情况下，每个核心将有1个线程，这将减少上下文切换到零。 Depending on the nature of the tasks, this might not be possible, maybe the threads have to wait for external data, or read from disk or whatever so you may find that the number of threads is increased by some scaling factor. 根据任务的性质，这可能是不可能的，也许线程必须等待外部数据，或从磁盘或其他任何内容读取，因此您可能会发现线程数增加了一些比例因子。