简体   繁体   English

C++ STL (ExecutionPolicy) 算法如何确定要使用多少并行线程?

[英]How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?

C++17 upgraded 69 STL algorithms to support parallelism, by the use of an optional ExecutionPolicy parameter (as the 1st argument). C++17 通过使用可选的 ExecutionPolicy 参数(作为第一个参数)升级了 69 个 STL 算法以支持并行性。 eg.例如。

std::sort(std::execution::par, begin(v), end(v));

I suspect the C++17 standard deliberately says nothing about how to implement the multi-threaded algorithms, leaving it up to the library writers to decide what is best (and allowing them to change their minds, later).我怀疑 C++17 标准故意没有说明如何实现多线程算法,让库作者决定什么是最好的(并允许他们稍后改变主意)。 Still, I'm keen to understand at a high level what issues are being considered in the implementation of the parallel STL algorithms.尽管如此,我还是很想从高层次上了解在并行 STL 算法的实现中正在考虑哪些问题。

Some questions on my mind include (but are not limited to!):我想到的一些问题包括(但不限于!):

  • How is the maximum number of threads used (by the C++ application) related to the number of CPU &/or GPU cores on the machine?使用的最大线程数(由 C++ 应用程序使用)与机器上的 CPU 和/或 GPU 内核数有什么关系?
  • What differences are there in the number of threads each algorithm uses?每种算法使用的线程数有什么区别? (Will each algorithm always use the same number of threads in every circumstance?) (每个算法在每种情况下都会使用相同数量的线程吗?)
  • Is there any consideration given to other parallel STL calls on other threads (within the same app)?是否考虑了其他线程(在同一应用程序内)上的其他并行 STL 调用? (eg. if a thread invokes std::for_each(par,...), will it use more/less/same threads depending on if a std::sort(par, ...) is already running on some other thread(s)? Is there a thread pool perhaps?) (例如,如果一个线程调用 std::for_each(par,...),它会使用更多/更少/相同的线程,这取决于 std::sort(par,...) 是否已经在某个其他线程上运行(s)?也许有线程池?)
  • Is any consideration given to how busy the cores are due to external factors?是否考虑过外部因素导致的内核繁忙程度? (eg. if 1 core is very busy, say analysing SETI signals, will the C++ application reduce the number of threads it uses?) (例如,如果 1 个内核非常繁忙,比如说分析 SETI 信号,C++ 应用程序会减少它使用的线程数吗?)
  • Do some algorithms only use CPU cores?某些算法是否仅使用 CPU 内核? or only GPU cores?还是只有 GPU 核心?
  • I suspect implementations will vary from library to library (compiler to compiler?), even details about this would be interesting.我怀疑实现会因库而异(编译器到编译器?),甚至关于这方面的细节也会很有趣。

I realise the point of these parallel algorithms is to shield the Programmer from having to worry about these details.我意识到这些并行算法的重点是让程序员不必担心这些细节。 However, any info that gives me a high-level mental picture of what's going on inside the library calls would be appreciated.但是,任何能让我对库调用内部发生的事情有一个高层次的心理图景的信息都将不胜感激。

Most of these questions can not be answered by the standard as of today.这些问题中的大多数都无法用今天的标准来回答。 However, your question, as I understand it, mixes two concepts:但是,据我了解,您的问题混合了两个概念:

C1. C1。 Constraints on parallel algorithms对并行算法的限制

C2. C2. Execution of algorithms算法的执行

All the C++17 parallel STL thing is about C1: it sets constraints on how instructions and/or threads could be interleaved/transformed in a parallel computation.所有的C ++ 17平行STL的是约C1:它设置关于如何指令和/或线程可以被交织/转化的并行计算的约束。 On the other hand, C2 is about being standardized, the keyword is executor (more on this later).另一方面,C2 是关于标准化的,关键字是executor (稍后会详细介绍)。

For C1, there are 3 standard policies (in std::execution::seq , par and par_unseq ) that correspond to every combination of task and instruction parallelism.对于 C1,有 3 个标准策略(在std::execution::seqparpar_unseq )对应于任务和指令并行的每个组合。 For example, when performing an integer accumulation, par_unseq could be used, since the order is not important.例如,在执行整数累加时,可以使用par_unseq ,因为顺序并不重要。 However, for float point arithmetic, where addition is not associative, a better fit would be seq to, at least, get a deterministic result.但是,对于浮点运算,其中除了不关联,更适合将seq来,至少得到一个确定的结果。 In short: policies set constraints on parallel computation and these constraints could be potentially exploited by a smart compiler.简而言之:策略对并行计算设置了约束,而这些约束可能会被智能编译器利用。

On the other hand, once you have a parallel algorithm and its constraints (and possibly after some optimization/transformation), the executor will find a way to execute it.另一方面,一旦你有了一个并行算法及其约束(可能在一些优化/转换之后), executor就会找到执行它的方法。 There are default executors (for CPU for example) or you can create your own, then, all that configuration regarding number of threads, workload, processing unit, etc... can be set.有默认执行程序(例如 CPU),或者您可以创建自己的执行程序,然后,可以设置有关线程数量、工作负载、处理单元等的所有配置。

As of today, C1 is in the standard, but not C2, so if you use C1 with a compliant compiler, you will not be able to specify which execution profile you want and the library implementation will decide for you (maybe through extensions).截至今天,C1 在标准中,但不在 C2 中,因此如果您将 C1 与兼容的编译器一起使用,您将无法指定所需的执行配置文件,并且库实现将为您决定(可能通过扩展)。

So, to address your questions:因此,要解决您的问题:

(Regarding your first 5 questions) By definition, C++17 parallel STL library does not define any computation, just data dependency, in order to allow for possible data flow transformations. (关于您的前 5 个问题)根据定义,C++17 并行 STL 库不定义任何计算,仅定义数据依赖性,以允许可能的数据流转换。 All these questions will be answered (hopefully) by executor , you can see the current proposal here .所有这些问题都将由executor回答(希望如此),您可以在此处查看当前的提案。 It will look something like:它看起来像:

executor = get_executor();
sort( std::execution::par.on(executor), vec.begin(), vec.end());

Some of your questions are already defined in that proposal.您的一些问题已在该提案中定义。

(For the 6th) There are a number of libraries out there that already implement similar concepts (C++ executor was inspired by some of them indeed), AFAIK: hpx, Thrust or Boost.Compute. (对于第 6 次)有许多库已经实现了类似的概念(C++ executor确实受到了其中一些的启发),AFAIK:hpx、Thrust 或 Boost.Compute。 I do not know how the last two are actually implemented, but for hpx they use lightweight threads and you can configure execution profile.我不知道最后两个是如何实际实现的,但是对于 hpx,它们使用轻量级线程,您可以配置执行配置文件。 Also, the expected (not yet standardized) syntax of the code above for C++17 is essentially the same as in (was heavily inspired by) hpx.此外,上述 C++17 代码的预期(尚未标准化)语法与 hpx 中的(深受启发)基本相同。

References:参考:

  1. C++17 Parallel Algorithms and Beyond by Bryce Adelstein lelbach C++17 Parallel Algorithms and Beyond by Bryce Adelstein lelbach
  2. The future of ISO C++ Heterogeneous Computing by Michael Wong ISO C++ 异构计算的未来 作者Michael Wong
  3. Keynote C++ executors to enable heterogeneous computing in tomorrow's C++ today by Michael Wong 主题演讲 C++ 执行器今天在明天的 C++ 中实现异构计算作者 Michael Wong
  4. Executors for C++ - A Long Story by Detlef Vollmann C++ Executors - Detlef Vollmann 的长篇故事

Pre-final C++17 draft tells nothing about " how to implement the multi-threaded algorithms ", that's true. Pre-final C++17 草案没有说明“如何实现多线程算法”,这是真的。 Implementation owners decide on their own how to do that.实施所有者自己决定如何做到这一点。 Eg Parallel STL uses TBB as a threading back-end and OpenMP as a vectorization back-end.例如,并行 STL使用TBB作为线程后端,使用OpenMP作为矢量化后端。 I guess that to find out how does this implementation matches your machine - you need to read the implementation-specific documentation我想要了解此实现如何与您的机器匹配 - 您需要阅读特定实现的文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM