并行减少CPU上的数组

Question

Is there a way to do parallel reduction of an array on CPU in C/C++?. 有没有办法在C / C ++中并行减少CPU上的数组？ I recently learnt that it's not possible using openmp . 我最近了解到使用openmp是不可能的。 Any other alternatives? 还有其他选择吗？

Answer 1

Added : Note that you can implement "custom" reduction with OpenMP, in the way described here . 补充：请注意，您可以按照此处描述的方式使用OpenMP实现“自定义”缩减。

For C++: with parallel_reduce in Intel's TBB (SO tag: tbb ), you can make reduction on complex types such as arrays and structs. 对于C ++：与parallel_reduce在英特尔TBB （SO标签： TBB ），你可以对复杂的类型，比如数组和结构减少。 Though the amount of required code can be significantly bigger compared to OpenMP's reduction clause. 虽然与OpenMP的减少条款相比，所需代码的数量可能会大得多。

As an example, let's parallelize a naive implementation of matrix-to-vector multiplication: y=Cx . 作为一个例子，让我们并行化矩阵到矢量乘法的简单实现： y=Cx 。 Serial code consists of two loops: 串行代码包含两个循环：

double x[N], y[M], C[N][M];
// assume x and C are initialized, and y consists of zeros
for(int i=0; i<N; ++i)
  for(int j=0; j<M; ++j) 
    y[j] += C[i][j]*x[i];

Usually, to parallelize it the loops are exchanged to make the outer loop iterations independent and process them in parallel: 通常，为了并行化它，循环被交换以使外循环迭代独立并且并行处理它们：

#pragma omp parallel for
for(int j=0; j<M; ++j) 
  for(int i=0; i<N; ++i)
    y[j] += C[i][j]*x[i];

However it's not always good idea. 然而，这并不总是好主意。 If M is small and N is large, swapping the loop won't give enough parallelism (for example, think of calculating a weighted centroid of N points in M-dimensional space, with C being the array of points and x being the array of weights). 如果M很小且N很大，则交换循环将不会提供足够的并行性（例如，考虑计算M维空间中N个点的加权质心，其中C是点数组， x是数组权重）。 So a reduction over an array (ie a point) would be helpful. 因此减少数组（即一个点）会有所帮助。 Here is how it can be done with TBB (sorry, the code was not tested, errors are possible): 以下是如何使用TBB（抱歉，代码未经过测试，错误可能）：

struct reduce_body {
  double y_[M]; // accumulating vector
  double (& C_)[N][M]; // reference to a matrix
  double (& x_)[N];    // reference to a vector

  reduce_body( double (&C)[N][M], double (&x)[N] )  : C_(C), x_(x) {
    for (int j=0; j<M; ++j) y_[j] = 0.0; // prepare for accumulation
  }
  // splitting constructor required by TBB
  reduce_body( reduce_body& rb, tbb::split ) : C_(rb.C_), x_(rb.x_) { 
    for (int j=0; j<M; ++j) y_[j] = 0.0;
  }
  // the main computation method
  void operator()(const tbb::blocked_range<int>& r) {
    // closely resembles the original serial loop
    for (int i=r.begin(); i<r.end(); ++i) // iterates over a subrange in [0,N)
      for (int j=0; j<M; ++j)
        y_[j] += C_[i][j]*x_[i];
  }
  // the method to reduce computations accumulated in two bodies
  void join( reduce_body& rb ) {
    for (int j=0; j<M; ++j) y_[j] += rb.y_[j];
  }
};
double x[N], y[M], C[N][M];
...
reduce_body body(C, x);
tbb::parallel_reduce(tbb::blocked_range<int>(0,N), body);
for (int j=0; j<M; ++j)
  y[j] = body.y_[j]; // copy to the destination array

Disclaimer: I am affiliated with TBB. 免责声明：我隶属于TBB。

并行减少CPU上的数组

问题描述

1 个解决方案

解决方案1
7 已采纳 2012-02-22 19:32:57

并行减少CPU上的数组

问题描述

1 个解决方案

解决方案1 7 已采纳 2012-02-22 19:32:57

解决方案1
7 已采纳 2012-02-22 19:32:57