[英]tbb::parallel_reduce vs tbb::parallel_deterministic_reduce
Threading Building Blocks (TBB) library provides two functions for performing reduction over a range: Threading Building Blocks (TBB) 库提供了两个用于在一定范围内执行缩减的函数:
Which one of two shall be selected if I want to perform the reduction as fast as possible, but still get exactly the same answer independently on hardware concurrency and the load from other processes or threads?如果我想尽可能快地执行缩减,但仍然独立于硬件并发性和来自其他进程或线程的负载而得到完全相同的答案,应该选择两者中的哪一个? I am basically interested in two scenarios:
我基本上对两种情况感兴趣:
And a side question.还有一个附带问题。 On the page about
parallel_deterministic_reduce
there is one warning:在关于
parallel_deterministic_reduce
的页面上有一个警告:
Since
simple_partitioner
does not automatically coarsen ranges, make sure to specify an appropriate grain size由于
simple_partitioner
不会自动粗化范围,请确保指定适当的粒度
Does it mean that the call to parallel_deterministic_reduce
with a range having no explicitly specified grain size will lead to poor performance?这是否意味着在没有明确指定粒度的范围内调用
parallel_deterministic_reduce
会导致性能不佳? How grain size shall be set then?那么粒度应该怎么设置呢?
parallel_reduce
does not make any guarantees regarding the summation order. parallel_reduce
不对求和顺序做出任何保证。 If used with floating point numbers, the result is not deterministic since summation of floating point numbers is not associative.如果与浮点数一起使用,则结果不确定,因为浮点数的求和不是关联的。 In contrast,
parallel_deterministic_reduce
guarantees that the summation order is always the same, regardless of the number of threads used.相反,
parallel_deterministic_reduce
保证求和顺序始终相同,无论使用的线程数如何。 But note that there is still no guarantee of a specific summation order, just that the order is deterministic (for example, the result can differ compared to std::accumulate
).但请注意,仍然不能保证特定的求和顺序,只是顺序是确定性的(例如,结果可能与
std::accumulate
不同)。
Thus:因此:
parallel_reduce
for best performance.parallel_reduce
以获得最佳性能。parallel_deterministic_reduce
if you need deterministic behavior.parallel_deterministic_reduce
。 Regarding the note that simple_partitioner
does not automatically coarsen ranges: I am not entirely sure why they mention this specifically in the documentation of parallel_deterministic_reduce
.关于
simple_partitioner
不会自动粗化范围的注意事项:我不完全确定他们为什么在parallel_deterministic_reduce
的文档中特别提到这一点。 In all cases where you use simple_partitioner
, you should think about appropriate grain sizes.在您使用
simple_partitioner
的所有情况下,您都应该考虑合适的粒度。 Not just in case of parallel_deterministic_reduce
.不仅仅是在
parallel_deterministic_reduce
的情况下。 A too small grain size can lead to a large overhead.太小的粒度会导致很大的开销。 See for example here .
参见 此处的示例。 In practice this especially means to measure the performance for typical work loads, and play with the grain size or partitioner such that performance is maximized.
在实践中,这尤其意味着测量典型工作负载的性能,并使用粒度或分区器来最大化性能。 I guess they just wanted to highlight the general issue again for
parallel_deterministic_reduce
because parallel_deterministic_reduce
supports only simple_partitioner
and static_partitioner
, but neither affinity_partitioner
nor auto_partitioner
(where the latter is usually the default).我猜他们只是想再次强调
parallel_deterministic_reduce
的一般问题,因为parallel_deterministic_reduce
仅支持simple_partitioner
和static_partitioner
,但既不支持affinity_partitioner
也不支持auto_partitioner
(后者通常是默认值)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.