[英]How to set number of threads in C++
I have written the following multi-threaded program for multi-threaded sorting using std::sort. 我已经编写了以下多线程程序,用于使用std :: sort进行多线程排序。 In my program grainSize is a parameter.
在我的程序中,grainSize是一个参数。 Since grainSize or the number of threads which can spawn is a system dependent feature.
由于grainSize或可以生成的线程数是系统相关的功能。 Therefore, I am not getting what should be the optimal value to which I should set the grainSize to?
因此,我没有得到应该将grainSize设置为的最佳值? I work on Linux?
我在Linux上工作?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. 我需要执行多线程排序。 So, that for sorting large vectors I can take advantage of multiple cores present in today's processor.
因此,对于排序大型矢量,我可以利用当今处理器中存在的多个核心。 If anyone is aware of an efficient algorithm then please do share.
如果有人知道一个有效的算法,那么请分享。
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same 我不知道为什么multiThreadedSort()返回的值没有排序,你看到它有一些逻辑错误,那么请让我知道相同的
This gives you the optimal number of threads (such as the number of cores): 这为您提供了最佳线程数(例如核心数):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize
: it will depend on list size, and will potentially be much more than grainSize. 在您编写它时,您的有效线程数不等于
grainSize
:它将取决于列表大小,并且可能远远超过grainSize。
Just replace grainSize by : 只需将grainSize替换为:
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). 40是任意的但是要避免启动线程以便排序到少数项目,这将是次优的(开始线程的时间将大于排序少数项目)。 It may be optimized by trial-and-error, and is potentially larger than 40.
它可以通过反复试验进行优化,并且可能大于40。
You have at least a bug there: 你至少有一个错误:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. 如果len是奇数(例如9),则不包括排序中的最后一项。 Replace by:
替换为:
multThreadedSort(data + len/2, len-(len/2), grainsize);
Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty ), several invocations of std::future
should already do the job for you, without having to worry. 除非你使用完全破坏的编译器(破坏是错误的单词,更好的匹配将是... shitty ),
std::future
几次调用应该已经为你完成了工作,而不必担心。
Note that std::future
is something that conceptually runs asynchronously, ie it may spawn another thread to execute concurrently. 请注意,
std::future
是概念上异步运行的东西,即它可能产生另一个并发执行的线程。 May, not must, mind you. 可是,不是必须,请注意。
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait()
. 这意味着对于一个实现来说,每个未来简单地生成一个线程是完全“合法的”,并且永远不会产生任何线程并且只是在
wait()
执行任务也是合法的。
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on. 在实践中,理智的实现避免按需生成线程,而是使用线程池,其中根据运行代码的系统将worker的数量设置为合理的值。
Note that trying to optimize threading with std::thread::hardware_concurrency()
does not really help you because the wording of that function is too loose to be useful. 请注意,尝试使用
std::thread::hardware_concurrency()
优化线程并不能真正帮助您,因为该函数的措辞过于宽松而无法使用。 It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value. 完全允许实现返回零,或者或多或少任意“最佳猜测”,并且没有机制可以检测返回值是真值还是废话值。
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. 也没有办法区分超线程核心,或NUMA意识等任何东西。 Thus, even if you assume that the number is correct, it is still not very meaningful at all.
因此,即使您认为数字是正确的,它仍然没有多大意义。
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). 问题“什么是正确的线程数”很难解决,如果有一个很好的通用答案(我相信没有)。 A couple of things to consider:
需要考虑的几件事:
What's the conclusion of the somewhat self-contradicting points above? 上面有些自相矛盾的观点得出的结论是什么? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case.
在实现它之后,一定要确定它是否真的运行得更快,因为这绝不是保证。 And unluckily, there's no way of knowing with certitude what's best without having measured.
而且不幸的是,如果没有测量,就没有办法确定什么是最好的。
As another thing, consider that sorting is by no means trivial to parallelize. 另外,考虑排序并不是平行化的。 You are already using
std::inplace_merge
so you seem to be aware that it's not just "split subranges and sort them". 您已经在使用
std::inplace_merge
因此您似乎意识到它不仅仅是“拆分子范围并对它们进行排序”。
But think about it, what exactly does your approach really do? 但想一想,你的方法到底做了什么? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting.
您将细分(递归递减)细分到某个深度,然后同时对子范围进行排序,并合并 - 这意味着重写。 Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted.
然后,您正在排序(递归递增)更大的范围并合并它们,直到整个范围被排序。 Classic fork-join.
经典的叉子连接。
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. 这意味着您触摸内存的某些部分以对其进行排序(以不缓存的模式),然后再次触摸它以合并它。 Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range.
然后再次触摸它以对较大的范围进行排序,然后再触摸它以合并更大的范围。 With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
任何“运气”,不同的线程将在不同的时间访问内存位置,因此您将有错误的共享。
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. 此外,如果您对“大数据”的理解与我的相同,则意味着您将覆盖20到30次之间的每个内存位置,可能更频繁。 That's a lot of traffic.
这是一个很大的流量。
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth . 这么多内存被反复读取和写入,一遍又一遍, 主要的瓶颈是内存带宽 。 See where I'm going?
看看我要去哪里? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).
的fork-join看起来像一个巧妙的事情,并在学术界它可能是...但它并不一定在所有这运行的真机上的任何速度(它可能很可能是慢很多倍)。
Ideally, you cannot assume more than n*2 thread running in your system. 理想情况下,您不能假设系统中运行的线程数超过n * 2。 n is number of CPU cores.
n是CPU核心数。
Modern OS uses concept of Hyperthreading . 现代操作系统使用超线程的概念。 So, now on one CPU at a time can run 2 threads.
所以,现在一个CPU上一次可以运行2个线程。
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();
正如另一个答案中所提到的,在C ++ 11中,您可以使用
std::thread::hardware_concurrency();
获得最佳线程数std::thread::hardware_concurrency();
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.