简体繁体 English

tbb::parallel_reduce 与 tbb::parallel_deterministic_reduce

[英]tbb::parallel_reduce vs tbb::parallel_deterministic_reduce

原文 2022-12-31 15:59:24 1 1 c++/ reduce/ tbb/ deterministic

Threading Building Blocks (TBB) library provides two functions for performing reduction over a range: Threading Building Blocks (TBB) 库提供了两个用于在一定范围内执行缩减的函数：

parallel_reduce , and parallel_reduce和
parallel_deterministic_reduce . parallel_deterministic_reduce 。

Which one of two shall be selected if I want to perform the reduction as fast as possible, but still get exactly the same answer independently on hardware concurrency and the load from other processes or threads?如果我想尽可能快地执行缩减，但仍然独立于硬件并发性和来自其他进程或线程的负载而得到完全相同的答案，应该选择两者中的哪一个？ I am basically interested in two scenarios:我基本上对两种情况感兴趣：

Computing the sum of elements in integer -value vector.计算integer值向量中元素的总和。
Computing the sum of elements in floating-point -value vector.计算浮点值向量中元素的总和。

And a side question.还有一个附带问题。 On the page about parallel_deterministic_reduce there is one warning:在关于parallel_deterministic_reduce的页面上有一个警告：

Since simple_partitioner does not automatically coarsen ranges, make sure to specify an appropriate grain size由于simple_partitioner不会自动粗化范围，请确保指定适当的粒度

Does it mean that the call to parallel_deterministic_reduce with a range having no explicitly specified grain size will lead to poor performance?这是否意味着在没有明确指定粒度的范围内调用parallel_deterministic_reduce会导致性能不佳？ How grain size shall be set then?那么粒度应该怎么设置呢？

1 个解决方案

parallel_reduce does not make any guarantees regarding the summation order. parallel_reduce不对求和顺序做出任何保证。 If used with floating point numbers, the result is not deterministic since summation of floating point numbers is not associative.如果与浮点数一起使用，则结果不确定，因为浮点数的求和不是关联的。 In contrast, parallel_deterministic_reduce guarantees that the summation order is always the same, regardless of the number of threads used.相反， parallel_deterministic_reduce保证求和顺序始终相同，无论使用的线程数如何。 But note that there is still no guarantee of a specific summation order, just that the order is deterministic (for example, the result can differ compared to std::accumulate ).但请注意，仍然不能保证特定的求和顺序，只是顺序是确定性的（例如，结果可能与std::accumulate不同）。

Thus:因此：

In case of integers, you should use parallel_reduce for best performance.如果是整数，您应该使用parallel_reduce以获得最佳性能。
For floating point numbers, you should use parallel_deterministic_reduce if you need deterministic behavior.对于浮点数，如果需要确定性行为，则应使用parallel_deterministic_reduce 。

Regarding the note that simple_partitioner does not automatically coarsen ranges: I am not entirely sure why they mention this specifically in the documentation of parallel_deterministic_reduce .关于simple_partitioner不会自动粗化范围的注意事项：我不完全确定他们为什么在parallel_deterministic_reduce的文档中特别提到这一点。 In all cases where you use simple_partitioner , you should think about appropriate grain sizes.在您使用simple_partitioner的所有情况下，您都应该考虑合适的粒度。 Not just in case of parallel_deterministic_reduce .不仅仅是在parallel_deterministic_reduce的情况下。 A too small grain size can lead to a large overhead.太小的粒度会导致很大的开销。 See for example here .参见此处的示例。 In practice this especially means to measure the performance for typical work loads, and play with the grain size or partitioner such that performance is maximized.在实践中，这尤其意味着测量典型工作负载的性能，并使用粒度或分区器来最大化性能。 I guess they just wanted to highlight the general issue again for parallel_deterministic_reduce because parallel_deterministic_reduce supports only simple_partitioner and static_partitioner , but neither affinity_partitioner nor auto_partitioner (where the latter is usually the default).我猜他们只是想再次强调parallel_deterministic_reduce的一般问题，因为parallel_deterministic_reduce仅支持simple_partitioner和static_partitioner ，但既不支持affinity_partitioner也不支持auto_partitioner （后者通常是默认值）。