简体   繁体   English


[英]How do waiting threads affect performance?

I am writing a program which has a relatively expensive calculation when the worst case scenario is reached. 我正在编写一个程序,当达到最坏的情况时,该程序的计算将相对昂贵。 I have tried to dynamically create threads and this has proven to work most of the time, but when the worst case scenario comes around, the execution speed is beyond the time I am allotted to complete these calculations, which is largely due to the creation and destruction of these threads. 我已经尝试过动态创建线程,并且事实证明,它在大多数情况下都可以工作,但是当出现最坏情况时,执行速度超出了我分配完成这些计算的时间,这在很大程度上是由于创建和这些线程的破坏。 This has lead me to the idea that I have used in the past, which is creating the threads prior to execution, instead of creating and destroying them dynamically, and having them wait on a condition before performing the calculation instead of creating them dynamically. 这使我想到了过去使用的想法,即在执行之前创建线程,而不是动态创建和销毁它们,而是让它们在执行计算之前先等待条件,而不是动态创建它们。

Normally I wouldn't think twice about doing this, but because I will be creating lots of threads when the system is initializing, I am concerned with how this will effect the performance of the system. 通常,我不会三思而后行,但是由于在系统初始化时会创建很多线程,因此我担心这将如何影响系统性能。 This has begged a certain question: how do threads that are waiting on a condition effect the system, if at all? 这就提出了一个问题:正在等待条件的线程如何影响系统? Is creating threads during the program's initializing and only notifying them when I need to perform a calculation the correct way to approach this problem, or does there exist a better solution that I am unaware of? 是在程序初始化期间创建线程并仅在需要执行计算时通知线程才是解决此问题的正确方法,还是存在我不知道的更好的解决方案? I have thought about using a threadpool in order to do this, too. 我也考虑过使用线程池来执行此操作。 Would a threadpool be best for this situation? 线程池是否最适合这种情况?

Some information that you may find helpful to better answer this question: 您可能会发现一些有助于更好地回答此问题的信息:

--I am using the boost library (version 1_54_0) in order to multithread the program. -我正在使用boost库(版本1_54_0)以对程序进行多线程处理。

--I am using Windows 7 and Visual Studio. -我正在使用Windows 7和Visual Studio。

--If I create the threads when the program initializes, I will be creating 200-1000 threads (this number is predetermined as a #define and I won't necessarily be using all threads every time I need to do the calculation). -如果我在程序初始化时创建线程,则我将创建200-1000个线程(此数字已预先定义为#define,并且每次执行计算时都不必使用所有线程)。

--The number of threads needed varies each time I need to perform this calculation; -每次执行此计算时,所需的线程数都会有所不同; it is dependent on the number of inputs received which changes every time the calculation is performed, but can never exceed a maximum value (the maximum number being determined at compile time as a #define). 它取决于每次执行计算时都会改变的接收到的输入数量,但决不能超过最大值(在编译时将最大数量确定为#define)。

--The computer I am using has 32 cores. -我使用的计算机具有32核。

I am sorry if this question isn't up to par; 如果这个问题没有达到标准,我感到抱歉。 I am a new stack overflow user, so feel free to ask for more information and critique me as to how I can better explain the situation and problem. 我是一个新的堆栈溢出用户,所以随时索取更多信息并批评我如何更好地解释这种情况和问题。 Thank you in advance for your help! 预先感谢您的帮助!


Here is the source code (some variables have been renamed in compliance with my company's terms and conditions) 这是源代码(一些变量已按照我公司的条款和条件重命名)

for(int i = curBlob.boundingBoxStartY; i < curBlob.boundingBoxStartY + curBlob.boundingBoxHeight; ++i)
    for(int j = curBlob.boundingBoxStartX; j < curBlob.boundingBoxStartX + curBlob.boundingBoxWidth; ++j)
        for(int k = 0; k < NUM_FILTERS; ++k)
            if((int)arrayOfBinaryValues[channel][k].at<uchar>(i,j) == 1)
                for(int p = 0; p < NUM_FILTERS; ++p)
                    if(p != k)
                        if((curBlob.boundingBoxStartX + 1 < (curBlob.boundingBoxStartX + curBlob.boundingBoxWidth)) && ((int)arrayOfBinaryValues[channel][k].at<uchar>(i + 1,j) == 1))

                        if((curBlob.boundingBoxStartY + 1 < (curBlob.boundingBoxStartY + curBlob.boundingBoxHeight)) && ((int)arrayOfBinaryValues[channel][k].at<uchar>(i,j + 1) == 1))

Source code provided is strictly to show the complexity of the algorithm. 提供的源代码严格显示了算法的复杂性。

If the threads are REALLY waiting, they won't consume much resource at all - just a bit of memory, and a few slots of "space" in the waiting list for the scheduler (so there will be a small amount of extra overhead to "wake" or "wait" a thread, as there is a little more data to process - but these queues are usually fairly efficient, so I doubt you'll be able to measure that in an application where the actual threads do some meaningful work). 如果线程确实在等待,那么它们根本不会消耗太多资源-仅占用一点内存,而在调度程序的等待列表中只有几个“空间”插槽(因此,将会有少量额外开销) “唤醒”或“等待”线程,因为还有更多数据要处理-但是这些队列通常效率很高,所以我怀疑您是否能够在实际线程在其中进行有意义工作的应用程序中进行测量)。

Of course, if they periodically wake up, even if it's once a second, 1000 threads that wake up once a second means one context switch every millisecond, and that would potentially affect performance. 当然,如果它们定期唤醒(即使是每秒一次),那么每秒唤醒一次的1000个线程意味着每毫秒要进行一次上下文切换,这可能会影响性能。

I do however think that creating MANY threads is the wrong solution in nearly all cases. 但是,我确实认为在几乎所有情况下创建许多线程都是错误的解决方案。 Unless the logic in the threads is complex, and there is a huge amount of state/context to track for each thread, and this state or context is not easy to store somewhere, it may be correct to do this. 除非线程中的逻辑很复杂,并且每个线程要跟踪大量的状态/上下文,并且此状态或上下文不容易存储在某个位置,否则这样做是正确的。 But in most cases, I'd say using a small number of worker threads, and then having a queue of work items (including [some type of reference to] their respective state or context) will be a better method to achieve this. 但是在大多数情况下,我会说使用少量的工作线程,然后让工作项队列(包括对它们各自的状态或上下文的(某种类型的引用))将是实现此目的的更好方法。

Edit based on edit in question: 基于有关编辑编辑

Since (as far as I can tell) the thread is completely bound by CPU (or memory bandwidth) - there is no I/O or other "waiting around", the maximum performance will be achieved by running one thread per core in the system (possibly "minus one" for "other stuff that needs doing, such as communicating via network, disk I/O, and general OS/system work that needs to be done). 由于(据我所知)线程完全受CPU(或内存带宽)的约束-没有I / O或其他“等待”,因此,通过在系统中每个内核运行一个线程可以实现最大性能。 (对于“其他需要完成的工作,例如通过网络,磁盘I / O和需要完成的常规OS /系统工作进行通信”,可能为“减一”)。

Having more threads than the number of cores may even cause the processing to be SLOWER, if there are more threads ready to run than there are cores on the CPU, because now the OS will have multiple threads "fighting" for time, and this will cause extra thread scheduling effort on the part of the OS, and on top of that when one thread runs, it will load up the cache with useful content. 如果准备好运行的线程数多于CPU上的核数,则线程数多于内核数甚至可能导致处理变慢,因为现在操作系统将有多个线程在争夺时间,这将使导致操作系统方面额外的线程调度工作,最重要的是,当一个线程运行时,它将为缓存加载有用的内容。 When another thread gets to run on that same CPU core, it will force the cache to load other data into the cache, and when the "old" thread gets to run again, even if it's on the same CPU, it will have to reload the data it was using. 当另一个线程在同一CPU内核上运行时,它将迫使缓存将其他数据加载到缓存中,并且当“旧”线程再次运行时,即使它在同一CPU上,也必须重新加载它正在使用的数据。

I will do a quick experiment and come back with some numbers for one of my projects... 我将做一个快速实验,并为我的一个项目返回一些数字...

So, I have a small project that calculates " weird numbers ". 因此,我有一个小项目,可以计算“ 怪异数字 ”。 I use it here as a "comparison as of the time it takes to run one vs. more threads". 在这里,我将其用作“比较运行一个线程与更多线程所花费的时间”。 Each thread here uses fairly little memory - a few hundred bytes, so cache will probably have no effect at all. 这里的每个线程使用的内存都很少-数百个字节,因此缓存可能根本不起作用。 So the only variable here is the "startup cost" and marginal overhead due to competition between threads. 因此,这里唯一的变量是“启动成本”和线程之间竞争导致的边际开销。 The number of threads is dictated by the -t option. 线程数由-t选项决定。 The -e is "what number to stop at". -e是“要停止的数字”。

$ time ./weird -t 1 -e 50000 > /dev/null

real    0m6.393s
user    0m6.359s
sys 0m0.003s
$ time ./weird -t 2 -e 50000 > /dev/null

real    0m3.210s
user    0m6.376s
sys 0m0.013s
$ time ./weird -t 4 -e 50000 > /dev/null

real    0m1.643s
user    0m6.397s
sys 0m0.024s
$ time ./weird -t 8 -e 50000 > /dev/null

real    0m1.641s
user    0m6.397s
sys 0m0.028s
$ time ./weird -t 16 -e 50000 > /dev/null

real    0m1.644s
user    0m6.385s
sys 0m0.047s
$ time ./weird -t 256 -e 50000 > /dev/null

real    0m1.790s
user    0m6.420s
sys 0m0.342s
$ time ./weird -t 512 -e 50000 > /dev/null

real    0m1.779s
user    0m6.439s
sys 0m0.502s

As you can see, the amount of time to "run" the whole project improves from 1 to 2 and from 2 to 4 threads. 如您所见,“运行”整个项目的时间从1减少到2,从2减少到4。 But running more than 4 threads give almost identical results until we get to the hundreds (I skipped over a few steps in doubling the number of threads). 但是运行四个以上的线程会得到几乎相同的结果,直到达到数百个为止(我跳过了将线程数加倍的几个步骤)。

Now, to show the scheduling overhead, I upped the number of "numbers to find" with a bigger number after -e (this also makes the process run for longer, as the bigger numbers are more complex to calculate). 现在,为了显示调度开销,我在-e之后用更大的数字增加了“要查找的数字”的数量(这也使该过程运行更长的时间,因为更大的数字计算起来更复杂)。

$ time ./weird -t 512 -e 100000 > /dev/null

real    0m7.100s
user    0m26.195s
sys 0m1.542s
$ time ./weird -t 4 -e 100000 > /dev/null

real    0m6.663s
user    0m26.143s
sys 0m0.049s

Now, if it was ONLY the startup time that cost, we should see similar overhead (in sys ) between the 512 threads going to 50000 and the 512 threads going to 100000, but we are seeing a three times higher number. 现在,如果仅由启动时间决定成本,我们应该看到在512个线程达到50000个线程和在512个线程达到100000个线程之间的开销(在sys类似的),但我们看到的开销要高出三倍。 So, out of 6-7 seconds, running 512 threads (at full speed) vs running 4 threads wastes nearly 1.5s of CPU time (or about 0.4s per CPU). 因此,在6到7秒内,以全速运行512个线程与运行4个线程相比,浪费了将近1.5s的CPU时间(或每个CPU约0.4s)。 Sure, it's only about 5%, but 5% of wasted effort is still wasted. 当然,只有5%左右,但仍然有5%的工作浪费了。 There are a lot of cases where a 5% improvement in algorithm is "worth having". 在许多情况下,“值得”是对算法的5%改进。

Yes, this is an extreme case, and it could be argued that as long as most threads are waiting, it doesn't really matter. 是的,这是一个极端的情况,可以争辩说,只要大多数线程都在等待,那并不重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM