简体   繁体   中英

How do waiting threads affect performance?

I am writing a program which has a relatively expensive calculation when the worst case scenario is reached. I have tried to dynamically create threads and this has proven to work most of the time, but when the worst case scenario comes around, the execution speed is beyond the time I am allotted to complete these calculations, which is largely due to the creation and destruction of these threads. This has lead me to the idea that I have used in the past, which is creating the threads prior to execution, instead of creating and destroying them dynamically, and having them wait on a condition before performing the calculation instead of creating them dynamically.

Normally I wouldn't think twice about doing this, but because I will be creating lots of threads when the system is initializing, I am concerned with how this will effect the performance of the system. This has begged a certain question: how do threads that are waiting on a condition effect the system, if at all? Is creating threads during the program's initializing and only notifying them when I need to perform a calculation the correct way to approach this problem, or does there exist a better solution that I am unaware of? I have thought about using a threadpool in order to do this, too. Would a threadpool be best for this situation?

Some information that you may find helpful to better answer this question:

--I am using the boost library (version 1_54_0) in order to multithread the program.

--I am using Windows 7 and Visual Studio.

--If I create the threads when the program initializes, I will be creating 200-1000 threads (this number is predetermined as a #define and I won't necessarily be using all threads every time I need to do the calculation).

--The number of threads needed varies each time I need to perform this calculation; it is dependent on the number of inputs received which changes every time the calculation is performed, but can never exceed a maximum value (the maximum number being determined at compile time as a #define).

--The computer I am using has 32 cores.

I am sorry if this question isn't up to par; I am a new stack overflow user, so feel free to ask for more information and critique me as to how I can better explain the situation and problem. Thank you in advance for your help!

UPDATE

Here is the source code (some variables have been renamed in compliance with my company's terms and conditions)

for(int i = curBlob.boundingBoxStartY; i < curBlob.boundingBoxStartY + curBlob.boundingBoxHeight; ++i)
{
    for(int j = curBlob.boundingBoxStartX; j < curBlob.boundingBoxStartX + curBlob.boundingBoxWidth; ++j)
    {
        for(int k = 0; k < NUM_FILTERS; ++k)
        {
            if((int)arrayOfBinaryValues[channel][k].at<uchar>(i,j) == 1)
            {
                for(int p = 0; p < NUM_FILTERS; ++p)
                {
                    if(p != k)
                    {
                        if((curBlob.boundingBoxStartX + 1 < (curBlob.boundingBoxStartX + curBlob.boundingBoxWidth)) && ((int)arrayOfBinaryValues[channel][k].at<uchar>(i + 1,j) == 1))
                            ++count;

                        if((curBlob.boundingBoxStartY + 1 < (curBlob.boundingBoxStartY + curBlob.boundingBoxHeight)) && ((int)arrayOfBinaryValues[channel][k].at<uchar>(i,j + 1) == 1))
                            ++count;
                    }
                }
            }
        }
    }
}

Source code provided is strictly to show the complexity of the algorithm.

If the threads are REALLY waiting, they won't consume much resource at all - just a bit of memory, and a few slots of "space" in the waiting list for the scheduler (so there will be a small amount of extra overhead to "wake" or "wait" a thread, as there is a little more data to process - but these queues are usually fairly efficient, so I doubt you'll be able to measure that in an application where the actual threads do some meaningful work).

Of course, if they periodically wake up, even if it's once a second, 1000 threads that wake up once a second means one context switch every millisecond, and that would potentially affect performance.

I do however think that creating MANY threads is the wrong solution in nearly all cases. Unless the logic in the threads is complex, and there is a huge amount of state/context to track for each thread, and this state or context is not easy to store somewhere, it may be correct to do this. But in most cases, I'd say using a small number of worker threads, and then having a queue of work items (including [some type of reference to] their respective state or context) will be a better method to achieve this.

Edit based on edit in question:

Since (as far as I can tell) the thread is completely bound by CPU (or memory bandwidth) - there is no I/O or other "waiting around", the maximum performance will be achieved by running one thread per core in the system (possibly "minus one" for "other stuff that needs doing, such as communicating via network, disk I/O, and general OS/system work that needs to be done).

Having more threads than the number of cores may even cause the processing to be SLOWER, if there are more threads ready to run than there are cores on the CPU, because now the OS will have multiple threads "fighting" for time, and this will cause extra thread scheduling effort on the part of the OS, and on top of that when one thread runs, it will load up the cache with useful content. When another thread gets to run on that same CPU core, it will force the cache to load other data into the cache, and when the "old" thread gets to run again, even if it's on the same CPU, it will have to reload the data it was using.

I will do a quick experiment and come back with some numbers for one of my projects...

So, I have a small project that calculates " weird numbers ". I use it here as a "comparison as of the time it takes to run one vs. more threads". Each thread here uses fairly little memory - a few hundred bytes, so cache will probably have no effect at all. So the only variable here is the "startup cost" and marginal overhead due to competition between threads. The number of threads is dictated by the -t option. The -e is "what number to stop at".

$ time ./weird -t 1 -e 50000 > /dev/null

real    0m6.393s
user    0m6.359s
sys 0m0.003s
$ time ./weird -t 2 -e 50000 > /dev/null

real    0m3.210s
user    0m6.376s
sys 0m0.013s
$ time ./weird -t 4 -e 50000 > /dev/null

real    0m1.643s
user    0m6.397s
sys 0m0.024s
$ time ./weird -t 8 -e 50000 > /dev/null

real    0m1.641s
user    0m6.397s
sys 0m0.028s
$ time ./weird -t 16 -e 50000 > /dev/null

real    0m1.644s
user    0m6.385s
sys 0m0.047s
$ time ./weird -t 256 -e 50000 > /dev/null

real    0m1.790s
user    0m6.420s
sys 0m0.342s
$ time ./weird -t 512 -e 50000 > /dev/null

real    0m1.779s
user    0m6.439s
sys 0m0.502s

As you can see, the amount of time to "run" the whole project improves from 1 to 2 and from 2 to 4 threads. But running more than 4 threads give almost identical results until we get to the hundreds (I skipped over a few steps in doubling the number of threads).

Now, to show the scheduling overhead, I upped the number of "numbers to find" with a bigger number after -e (this also makes the process run for longer, as the bigger numbers are more complex to calculate).

$ time ./weird -t 512 -e 100000 > /dev/null

real    0m7.100s
user    0m26.195s
sys 0m1.542s
$ time ./weird -t 4 -e 100000 > /dev/null

real    0m6.663s
user    0m26.143s
sys 0m0.049s

Now, if it was ONLY the startup time that cost, we should see similar overhead (in sys ) between the 512 threads going to 50000 and the 512 threads going to 100000, but we are seeing a three times higher number. So, out of 6-7 seconds, running 512 threads (at full speed) vs running 4 threads wastes nearly 1.5s of CPU time (or about 0.4s per CPU). Sure, it's only about 5%, but 5% of wasted effort is still wasted. There are a lot of cases where a 5% improvement in algorithm is "worth having".

Yes, this is an extreme case, and it could be argued that as long as most threads are waiting, it doesn't really matter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM