执行计算时 - 我应该打开多少个线程？

Question

I am writing a program that performs some long computation, which I can break into as many tasks as I want. 我正在编写一个执行一些长计算的程序，我可以根据需要分成任意数量的任务。 For the sake of discussion, let's suppose I am writing an algorithm for finding whether or not a number p is prime by trying to divide it by all numbers between 2 and p-1. 为了便于讨论，让我们假设我正在编写一种算法，通过尝试将它除以2和p-1之间的所有数来找出数p是否为素数。 This task can obviously be broken down to many threads. 显然，这个任务可以分解为许多线程。

I actually wrote a sample app that does just that. 我实际上写了一个样本应用程序就是这样做的。 As a parameter, I give the number I want to check for, and the number of threads to use (each thread is given a range of equal size of numbers to try and divide p by - together they cover the entire range). 作为一个参数，我给出了我想要检查的数字，以及要使用的线程数（每个线程都有一系列相同大小的数字来尝试除以p - 它们一起覆盖整个范围）。

My machine has 8 cores. 我的机器有8个核心。 I started running the program with a large number that I know is prime (2971215073), and with 1, 2, 3 threads etc. until reaching 8 threads - each time the program ran faster than the previous, which was what I expected. 我开始使用大量的程序运行程序，我知道它是素数（2971215073），并且有1,2,3个线程等，直到达到8个线程 - 每次程序运行速度比前一个快，这是我的预期。 However, when I tried numbers larger than 8, the computation time actually kept getting smaller (even if by a little)! 但是，当我尝试大于8的数字时，计算时间实际上变得越来越小（即使是一点点）！

There's no I/O or anything like that in my threads, just pure cpu computations. 在我的线程中没有I / O或类似的东西，只是纯粹的cpu计算。 I was expecting the run-time to become worse when I passed 8 threads as there would be more context switching and the number of parallel running threads remains at 8. It is hard to say where the peak is as the differences are very little and change from one run to another, however it is clear that ie 50 threads somehow runs faster than 8 (by ~300 ms)... 当我通过8个线程时，我预计运行时间会变得更糟，因为会有更多的上下文切换，并行运行线程的数量保持在8个。很难说峰值在哪里因为差异很小而且变化很大从一次运行到另一次运行，但很明显，即50个线程以某种方式运行速度超过8（约300毫秒）......

My guess is that since I have so many threads, I get more running time since I have a larger portion in the system's thread pool, so my threads get selected more. 我的猜测是因为我有这么多线程，所以我得到更多的运行时间，因为我在系统的线程池中有更大的部分，所以我的线程被选中更多。 However, it doesn't seem to make sense that the more threads I create, the faster the program runs (otherwise why don't everyone create 1000 threads??). 但是，我创建的线程越多，程序运行得越快就越有意义（否则为什么不是每个人都创建1000个线程？）。

Can anyone offer an explanation, and perhaps a best-practice as to how many threads to create relative to the number of cores on the machine? 任何人都可以提供一个解释，也许是最佳实践，关于创建相对于机器上的核心数量的线程数量？

Thanks. 谢谢。

My code for whoever's interested (compiled on Windows, VS2012): 我感兴趣的人的代码（在Windows上编译，VS2012）：

#include <Windows.h>
#include <conio.h>
#include <iostream>
#include <thread>
#include <vector>

using namespace std;

typedef struct
{
    unsigned int primeCandidate;
    unsigned int rangeStart;
    unsigned int rangeEnd;
} param_t;


DWORD WINAPI isDivisible(LPVOID p)
{
    param_t* param = reinterpret_cast<param_t*>(p);

    for (unsigned int d = param->rangeStart; d < param->rangeEnd; ++d)
    {
        if (param->primeCandidate % d == 0)
        {
            cout << param->primeCandidate << " is divisible by " << d << endl;
            return 1;
        }
    }

    return 0;
}

bool isPrime(unsigned int primeCandidate, unsigned int numOfCores)
{
    vector<HANDLE> handles(numOfCores);
    vector<param_t> params(numOfCores);
    for (unsigned int i = 0; i < numOfCores; ++i)
    {
        params[i].primeCandidate = primeCandidate;
        params[i].rangeStart = (primeCandidate - 2) * (static_cast<double>(i) / numOfCores) + 2;
        params[i].rangeEnd = (primeCandidate - 2) * (static_cast<double>(i+1) / numOfCores) + 2;
        HANDLE h = CreateThread(nullptr, 0, reinterpret_cast<LPTHREAD_START_ROUTINE>(isDivisible), &params[i], 0, 0);
        if (NULL == h)
        {
            cout << "ERROR creating thread: " << GetLastError() << endl;
            throw exception();
        }
        handles[i] = h;
    }

    DWORD ret = WaitForMultipleObjects(numOfCores, &handles[0], TRUE, INFINITE);
    if (ret >= WAIT_OBJECT_0 && ret <= WAIT_OBJECT_0 + numOfCores - 1)
    {
        for (unsigned int i = 0; i < numOfCores; ++i)
        {
            DWORD exitCode = -1;
            if (0 == GetExitCodeThread(handles[i], &exitCode))
            {
                cout << "Failed to get thread's exit code: " << GetLastError() << endl;
                throw exception();
            }

            if (1 == exitCode)
            {
                return false;
            }
        }

        return true;
    }
    else
    {
        cout << "ERROR waiting on threads: " << ret << endl;
        throw exception();
    }
}

int main()
{
    unsigned int primeCandidate = 1;
    unsigned int numOfCores = 1;

    cout << "Enter prime candidate: ";
    cin >> primeCandidate;
    cout << "Enter # of cores (0 means all): ";
    cin >> numOfCores;
    while (primeCandidate > 0)
    {
        if (0 == numOfCores) numOfCores = thread::hardware_concurrency();

        DWORD start = GetTickCount();
        bool res = isPrime(primeCandidate, numOfCores);
        DWORD end = GetTickCount();
        cout << "Time: " << end-start << endl;
        cout << primeCandidate << " is " << (res ? "" : "not ") << "prime!" << endl;

        cout << "Enter prime candidate: ";
        cin >> primeCandidate;
        cout << "Enter # of cores (0 means all): ";
        cin >> numOfCores;
    }

    return 0;
}

Answer 1

Yes. 是。 Here is a small extract of some tests I did on my i7/Vista 64 box, (4 'real' cores + hyperthreading): 这是我在i7 / Vista 64盒子上进行的一些测试的一个小摘录，（4'真正'核心+超线程）：

8 tests,
400 tasks,
counting to 10000000,
using 8 threads:
Ticks: 2199
Ticks: 2184
Ticks: 2215
Ticks: 2153
Ticks: 2200
Ticks: 2215
Ticks: 2200
Ticks: 2230
Average: 2199 ms

8 tests,
400 tasks,
counting to 10000000,
using 32 threads:
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2138
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2137
Average: 2137 ms

.. showing that, like in your tests, an 'over-subscription' of threads does result in a marginal 2-3% improvement in overall execution time. ..显示，就像在你的测试中一样，“过度订阅”线程确实导致整体执行时间略微提高2-3％。 My tests submitted simple 'count up an integer' CPU-intensive tasks to a threadpool with varying numbers of threads. 我的测试将简单的'计算整数'CPU密集型任务提交给具有不同线程数的线程池。

My conclusion at the time was that the minor improvement was because the larger number of threads took up a larger %age of the 'base load' on my box - the 1-4% of load from the few of the 1000-odd threads in the nearly-always-idle Firefox, uTorrent, Word, Taskbar etc etc. that happened to run a bit during the tests. 我当时得出的结论是，较小的改进是因为大量的线程在我的盒子上占据了“基本负载”的较大％年龄 - 来自1000多个线程中少数线程的1-4％负载几乎总是闲置的Firefox，uTorrent，Word，任务栏等等在测试期间碰巧运行了一些。

It would appear that, in my test, the 'context switching overhead' from, say, using 64 threads instead of 8 is negligible, and can be ignored. 在我的测试中，似乎使用64个线程而不是8个线程的“上下文切换开销”可以忽略不计，并且可以忽略。

This only applies when the data used by the tasks is very small. 这仅适用于任务使用的数据非常小的情况。 I later repeated a similar batch of tests where the tasks used an 8K array - the size of the L1 cache. 我后来重复了一系列类似的测试，其中任务使用了8K阵列 - L1缓存的大小。 In this 'worst case' scenario, using more threads than cores resulted in a very noticeable slowdown until, at 16 threads and above, the performance dropped by 40% as the threads swapped the whole cache in and out. 在这种“最坏情况”的情况下，使用比核心更多的线程导致非常明显的减速，直到16个线程及以上，当线程交换整个缓存进出时，性能下降了40％。 Above about 20 threads, the slowdown did not get any worse since, no matter how many threads ran the tasks, the cache still got swapped out of every core at the same rate. 超过大约20个线程，减速没有变得更糟，因为无论有多少线程运行任务，缓存仍然以相同的速率从每个核心换出。

Note also that I had plenty of RAM and so very few page faults. 另请注意，我有足够的RAM，因此页面错误非常少。

Answer 2

You're making an assumption that each thread has an equal amount of work to perform, which may not actually be the case. 您假设每个线程都有相同数量的工作要执行，实际情况可能并非如此。 What you should look at is the exit times of each of your threads. 您应该注意的是每个线程的退出时间。 If one or more of them are exiting significantly earlier than the rest it will make sense that adding more threads will speed it up. 如果它们中的一个或多个比其余的更早退出，那么添加更多线程将加速它将是有意义的。 That is, if one stops early it means that a core will no longer be used, by having extra threads it breaks up the load more fairly. 也就是说，如果提前停止，则意味着将不再使用核心，通过使用额外的线程，它可以更公平地分解负载。

There are several reasons why each thread may take a different execution time. 每个线程可能会占用不同的执行时间有几个原因。 I don't know the underlying instruction timings on your code, but perhaps they are variable. 我不知道代码的基本指令时序，但也许它们是可变的。 Also likely is that each thread has a different set of CPU optimizations, like branch prediction. 也可能是每个线程都有一组不同的CPU优化，比如分支预测。 One may just lose its timeslice to the OS, or be momentarily stalled on its tiny amount of memory. 人们可能只是失去了对操作系统的时间片，或者暂时停留在它的微小内存上。 Suffice to say there are numerous factors which could make one slower than the other. 可以说，有许多因素可能使一个比另一个慢。

Which is the best count is hard to say. 哪个是最好的计数很难说。 In general you'd like to keep the CPUs loaded, so you are generally correct about N threads for N cores. 通常，您希望保持CPU的加载，因此对于N个内核，N通常是正确的。 However, be aware of things like hyperthreading, where you don't actually have extra cores -- unless you have a lot of memory use, which you don't, the hyperthreading will just get in the way. 但是，要注意像超线程这样的东西，你实际上并没有额外的内核 - 除非你有大量的内存使用，否则，超线程将只会妨碍你。 On AMD's newer chips they have half as many FPUs, so your integer instructions are fine, but floating point could stall. 在AMD的新芯片上，它们有一半的FPU，所以你的整数指令很好，但浮点数可能会停滞。

If you wish to keep each CPU loaded the only way to really do it is with a job based framework. 如果您希望保持每个CPU的加载，唯一的方法就是使用基于作业的框架。 Break your calculation into smaller units (as you do), but still only have one thread per core. 将您的计算分解为更小的单位（就像您一样），但每个核心仍然只有一个线程。 As a thread is done with its current job it should take the next available job. 当一个线程完成当前的工作时，它应该接受下一个可用的工作。 This way it doesn't matter if some jobs are longer/shorter, the freed up CPUs will just move on to the next job. 这样，如果某些作业更长/更短，则释放的CPU将继续执行下一个作业。

This of course only makes sense if the calculation is long. 这当然只有在计算时很长才有意义。 If the total time is only a few seconds the overhead of the jobs might cause a slight slowdown. 如果总时间只有几秒钟，则作业的开销可能会导致轻微的减速。 But even starting at 4-5 seconds you should start seeing gains. 但即使从4-5秒开始你也应该开始看到收益。 Also, make sure you turn off CPU frequency scaling when doing small timing tests, otherwise the speed up/down times on each CPU will basically give you random results. 此外，确保在进行小型定时测试时关闭CPU频率调整，否则每个CPU的加速/减速时间基本上会为您提供随机结果。

执行计算时 - 我应该打开多少个线程？

问题描述

2 个解决方案

解决方案1
5 2013-06-15 13:27:52

解决方案2
1 2013-06-15 14:45:45

执行计算时 - 我应该打开多少个线程？

问题描述

2 个解决方案

解决方案1 5 2013-06-15 13:27:52

解决方案2 1 2013-06-15 14:45:45

解决方案1
5 2013-06-15 13:27:52

解决方案2
1 2013-06-15 14:45:45