[英]How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?
C++17 upgraded 69 STL algorithms to support parallelism, by the use of an optional ExecutionPolicy parameter (as the 1st argument). C++17 通过使用可选的 ExecutionPolicy 参数(作为第一个参数)升级了 69 个 STL 算法以支持并行性。 eg.例如。
std::sort(std::execution::par, begin(v), end(v));
I suspect the C++17 standard deliberately says nothing about how to implement the multi-threaded algorithms, leaving it up to the library writers to decide what is best (and allowing them to change their minds, later).我怀疑 C++17 标准故意没有说明如何实现多线程算法,让库作者决定什么是最好的(并允许他们稍后改变主意)。 Still, I'm keen to understand at a high level what issues are being considered in the implementation of the parallel STL algorithms.尽管如此,我还是很想从高层次上了解在并行 STL 算法的实现中正在考虑哪些问题。
Some questions on my mind include (but are not limited to!):我想到的一些问题包括(但不限于!):
I realise the point of these parallel algorithms is to shield the Programmer from having to worry about these details.我意识到这些并行算法的重点是让程序员不必担心这些细节。 However, any info that gives me a high-level mental picture of what's going on inside the library calls would be appreciated.但是,任何能让我对库调用内部发生的事情有一个高层次的心理图景的信息都将不胜感激。
Most of these questions can not be answered by the standard as of today.这些问题中的大多数都无法用今天的标准来回答。 However, your question, as I understand it, mixes two concepts:但是,据我了解,您的问题混合了两个概念:
C1. C1。 Constraints on parallel algorithms对并行算法的限制
C2. C2. Execution of algorithms算法的执行
All the C++17 parallel STL thing is about C1: it sets constraints on how instructions and/or threads could be interleaved/transformed in a parallel computation.所有的C ++ 17平行STL的是约C1:它设置关于如何指令和/或线程可以被交织/转化的并行计算的约束。 On the other hand, C2 is about being standardized, the keyword is executor
(more on this later).另一方面,C2 是关于标准化的,关键字是executor
(稍后会详细介绍)。
For C1, there are 3 standard policies (in std::execution::seq
, par
and par_unseq
) that correspond to every combination of task and instruction parallelism.对于 C1,有 3 个标准策略(在std::execution::seq
、 par
和par_unseq
)对应于任务和指令并行的每个组合。 For example, when performing an integer accumulation, par_unseq
could be used, since the order is not important.例如,在执行整数累加时,可以使用par_unseq
,因为顺序并不重要。 However, for float point arithmetic, where addition is not associative, a better fit would be seq
to, at least, get a deterministic result.但是,对于浮点运算,其中除了不关联,更适合将seq
来,至少得到一个确定的结果。 In short: policies set constraints on parallel computation and these constraints could be potentially exploited by a smart compiler.简而言之:策略对并行计算设置了约束,而这些约束可能会被智能编译器利用。
On the other hand, once you have a parallel algorithm and its constraints (and possibly after some optimization/transformation), the executor
will find a way to execute it.另一方面,一旦你有了一个并行算法及其约束(可能在一些优化/转换之后), executor
就会找到执行它的方法。 There are default executors (for CPU for example) or you can create your own, then, all that configuration regarding number of threads, workload, processing unit, etc... can be set.有默认执行程序(例如 CPU),或者您可以创建自己的执行程序,然后,可以设置有关线程数量、工作负载、处理单元等的所有配置。
As of today, C1 is in the standard, but not C2, so if you use C1 with a compliant compiler, you will not be able to specify which execution profile you want and the library implementation will decide for you (maybe through extensions).截至今天,C1 在标准中,但不在 C2 中,因此如果您将 C1 与兼容的编译器一起使用,您将无法指定所需的执行配置文件,并且库实现将为您决定(可能通过扩展)。
So, to address your questions:因此,要解决您的问题:
(Regarding your first 5 questions) By definition, C++17 parallel STL library does not define any computation, just data dependency, in order to allow for possible data flow transformations. (关于您的前 5 个问题)根据定义,C++17 并行 STL 库不定义任何计算,仅定义数据依赖性,以允许可能的数据流转换。 All these questions will be answered (hopefully) by executor
, you can see the current proposal here .所有这些问题都将由executor
回答(希望如此),您可以在此处查看当前的提案。 It will look something like:它看起来像:
executor = get_executor();
sort( std::execution::par.on(executor), vec.begin(), vec.end());
Some of your questions are already defined in that proposal.您的一些问题已在该提案中定义。
(For the 6th) There are a number of libraries out there that already implement similar concepts (C++ executor
was inspired by some of them indeed), AFAIK: hpx, Thrust or Boost.Compute. (对于第 6 次)有许多库已经实现了类似的概念(C++ executor
确实受到了其中一些的启发),AFAIK:hpx、Thrust 或 Boost.Compute。 I do not know how the last two are actually implemented, but for hpx they use lightweight threads and you can configure execution profile.我不知道最后两个是如何实际实现的,但是对于 hpx,它们使用轻量级线程,您可以配置执行配置文件。 Also, the expected (not yet standardized) syntax of the code above for C++17 is essentially the same as in (was heavily inspired by) hpx.此外,上述 C++17 代码的预期(尚未标准化)语法与 hpx 中的(深受启发)基本相同。
References:参考:
Pre-final C++17 draft tells nothing about " how to implement the multi-threaded algorithms ", that's true. Pre-final C++17 草案没有说明“如何实现多线程算法”,这是真的。 Implementation owners decide on their own how to do that.实施所有者自己决定如何做到这一点。 Eg Parallel STL uses TBB as a threading back-end and OpenMP as a vectorization back-end.例如,并行 STL使用TBB作为线程后端,使用OpenMP作为矢量化后端。 I guess that to find out how does this implementation matches your machine - you need to read the implementation-specific documentation我想要了解此实现如何与您的机器匹配 - 您需要阅读特定于实现的文档
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.