简体   繁体   English

如何在C ++中设置线程数

[英]How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. 我已经编写了以下多线程程序,用于使用std :: sort进行多线程排序。 In my program grainSize is a parameter. 在我的程序中,grainSize是一个参数。 Since grainSize or the number of threads which can spawn is a system dependent feature. 由于grainSize或可以生成的线程数是系统相关的功能。 Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? 因此,我没有得到应该将grainSize设置为的最佳值? I work on Linux? 我在Linux上工作?

 int compare(const char*,const char*)
{
   //some complex user defined logic    
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
    if(len < grainsize) 
    {
        std::sort(data, data + len, compare);
    }
    else
    {
        auto future = std::async(multThreadedSort, data, len/2, grainsize);

        multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.

        future.wait();

        std::inplace_merge(data, data + len/2, data + len, compare);
    }
}

int main(int argc, char** argv) {

    vector<unsigned> items;
    int grainSize=10;
    multThreadedSort(items.begin(),items.size(),grainSize);
    std::sort(items.begin(),items.end(),CompareSorter(compare));
    return 0;
}

I need to perform multi-threaded sorting. 我需要执行多线程排序。 So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. 因此,对于排序大型矢量,我可以利用当今处理器中存在的多个核心。 If anyone is aware of an efficient algorithm then please do share. 如果有人知道一个有效的算法,那么请分享。

I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same 我不知道为什么multiThreadedSort()返回的值没有排序,你看到它有一些逻辑错误,那么请让我知道相同的

This gives you the optimal number of threads (such as the number of cores): 这为您提供了最佳线程数(例如核心数):

unsigned int nThreads = std::thread::hardware_concurrency();

As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize. 在您编写它时,您的有效线程数不等于grainSize :它将取决于列表大小,并且可能远远超过grainSize。

Just replace grainSize by : 只需将grainSize替换为:

unsigned int grainSize= std::max(items.size()/nThreads, 40);

The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). 40是任意的但是要避免启动线程以便排序到少数项目,这将是次优的(开始线程的时间将大于排序少数项目)。 It may be optimized by trial-and-error, and is potentially larger than 40. 它可以通过反复试验进行优化,并且可能大于40。

You have at least a bug there: 你至少有一个错误:

multThreadedSort(data + len/2, len/2, grainsize);

If len is odd (for instance 9), you do not include the last item in the sort. 如果len是奇数(例如9),则不包括排序中的最后一项。 Replace by: 替换为:

multThreadedSort(data + len/2, len-(len/2), grainsize);

Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty ), several invocations of std::future should already do the job for you, without having to worry. 除非你使用完全破坏的编译器(破坏是错误的单词,更好的匹配将是... shitty ), std::future几次调用应该已经为你完成了工作,而不必担心。

Note that std::future is something that conceptually runs asynchronously, ie it may spawn another thread to execute concurrently. 请注意, std::future概念上异步运行的东西,即它可能产生另一个并发执行的线程。 May, not must, mind you. 可是,不是必须,请注意。
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait() . 这意味着对于一个实现来说,每个未来简单地生成一个线程是完全“合法的”,并且永远不会产生任何线程并且只是在wait()执行任务也是合法的。
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on. 在实践中,理智的实现避免按需生成线程,而是使用线程池,其中根据运行代码的系统将worker的数量设置为合理的值。

Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. 请注意,尝试使用std::thread::hardware_concurrency()优化线程并不能真正帮助您,因为该函数的措辞过于宽松而无法使用。 It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value. 完全允许实现返回零,或者或多或少任意“最佳猜测”,并且没有机制可以检测返回值是真值还是废话值。
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. 也没有办法区分超线程核心,或NUMA意识等任何东西。 Thus, even if you assume that the number is correct, it is still not very meaningful at all. 因此,即使您认为数字是正确的,它仍然没有多大意义。

On a more general note 更一般地说

The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). 问题“什么是正确的线程数”很难解决,如果有一个很好的通用答案(我相信没有)。 A couple of things to consider: 需要考虑的几件事:

  1. Work groups of 10 are certainly way, way too small . 10人的工作组当然太小了 Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. 产生线程是一件非常昂贵的事情(是的,与流行的看法相反,这也适用于Linux),切换或同步线程也很昂贵。 Try "ten thousands", not "tens". 尝试“万”,而不是“十”。
  2. Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as eg REP-NOP on Intel). 超线程内核仅在同一组中的另一个内核停止时执行,最常见的是在内存I / O上(或者,在旋转时,通过显式执行指令,例如Intel上的REP-NOP)。 If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. 如果没有大量的内存停顿,则在超线程内核上运行的额外线程只会添加上下文切换,但运行速度不会更快。 For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes. 对于像排序这样的东西(这是关于访问内存的东西!),你可能很高兴去那里。
  3. Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). 内存带宽通常饱和一个,有时是两个内核,很少(取决于实际的硬件)。 Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. 在问题上投掷8或12个线程通常不会增加内存带宽,但会增加共享缓存级别(如L3存在,通常也是L2)和系统页面管理器的压力。 For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. 对于特殊的排序情况(非常不连贯的访问,许多停顿),情况可能相反。 May, but needs not be. 可是,但不一定是。
  4. Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation. 由于上述原因,对于一般情况,“实际核心数”或“实际核心数+ 1”通常是更好的推荐。
  5. Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. 使用您的方法访问具有较差位置的大量数据(单线程或多线程)会导致大量缓存/ TLB未命中,甚至可能导致页面错误。 That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. 这可能不仅可以消除线程并行性的任何收益,而且可能确实执行的速度慢了4-5个数量级。 Just think about what a page fault costs you. 只要想想你的页面错误是什么。 During a single page fault, you could have sorted a million elements. 在单页故障期间,您可以对一百万个元素进行排序。
  6. Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. 与上述“真实核心加1”一般规则相反,对于涉及可能长时间阻塞的网络或磁盘I / O的任务,即使“核心数量的两倍”也可能是最佳匹配。 So... there is really no single "correct" rule. 所以...实际上没有一个“正确”的规则。

What's the conclusion of the somewhat self-contradicting points above? 上面有些自相矛盾的观点得出的结论是什么? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. 在实现它之后,一定要确定它是否真的运行得更快,因为这绝不是保证。 And unluckily, there's no way of knowing with certitude what's best without having measured. 而且不幸的是,如果没有测量,就没有办法确定什么是最好的。

As another thing, consider that sorting is by no means trivial to parallelize. 另外,考虑排序并不是平行化的。 You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them". 您已经在使用std::inplace_merge因此您似乎意识到它不仅仅是“拆分子范围并对它们进行排序”。

But think about it, what exactly does your approach really do? 但想一想,你的方法到底做了什么? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. 您将细分(递归递减)细分到某个深度,然后同时对子范围进行排序,并合并 - 这意味着重写。 Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. 然后,您正在排序(递归递增)更大的范围并合并它们,直到整个范围被排序。 Classic fork-join. 经典的叉子连接。
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. 这意味着您触摸内存的某些部分以对其进行排序(以不缓存的模式),然后再次触摸它以合并它。 Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. 然后再次触摸它以对较大的范围进行排序,然后再触摸它以合并更大的范围。 With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing. 任何“运气”,不同的线程将在不同的时间访问内存位置,因此您将有错误的共享。
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. 此外,如果您对“大数据”的理解与我的相同,则意味着您将覆盖20到30次之间的每个内存位置,可能更频繁。 That's a lot of traffic. 这是一个很大的流量。

So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth . 这么多内存被反复读取和写入,一遍又一遍, 主要的瓶颈是内存带宽 See where I'm going? 看看我要去哪里? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower). 的fork-join看起来像一个巧妙的事情,并在学术界它可能是...但它并不一定在所有这运行的真机上的任何速度(它可能很可能是慢很多倍)。

Ideally, you cannot assume more than n*2 thread running in your system. 理想情况下,您不能假设系统中运行的线程数超过n * 2。 n is number of CPU cores. n是CPU核心数。

Modern OS uses concept of Hyperthreading . 现代操作系统使用超线程的概念。 So, now on one CPU at a time can run 2 threads. 所以,现在一个CPU上一次可以运行2个线程。

As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency(); 正如另一个答案中所提到的,在C ++ 11中,您可以使用std::thread::hardware_concurrency();获得最佳线程数std::thread::hardware_concurrency();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM