在矩阵向量乘法中使用 OpenMP“for simd”？

Question

I'm currently trying to get my matrix-vector multiplication function to compare favorably with BLAS by combining #pragma omp for with #pragma omp simd , but it's not getting any speedup improvement than if I were to just use the for construct.我目前正在尝试通过将#pragma omp for与#pragma omp simd相结合，使我的矩阵向量乘法 function 与 BLAS 相比具有优势，但与仅使用 for 构造相比，它并没有得到任何加速改进。 How do I properly vectorize the inner loop with OpenMP's SIMD construct?如何使用 OpenMP 的 SIMD 构造正确矢量化内部循环？

vector dot(const matrix& A, const vector& x)
{
  assert(A.shape(1) == x.size());

  vector y = xt::zeros<double>({A.shape(0)});

  int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
  {
#pragma omp for // schedule(static)
    for (i = 0; i < y.size(); i++) { // row major
#pragma omp simd
      for (j = 0; j < x.size(); j++) {
        y(i) += A(i, j) * x(j);
      }
    }
  }

  return y;
}

Answer 1

Your directive is incorrect because there would introduce in a race condition (on y(i) ).您的指令不正确，因为会引入竞争条件（在y(i)上）。 You should use a reduction in this case.在这种情况下，您应该使用减少。 Here is an example:这是一个例子：

vector dot(const matrix& A, const vector& x)
{
  assert(A.shape(1) == x.size());

  vector y = xt::zeros<double>({A.shape(0)});

  int i, j;

  #pragma omp parallel shared(A, x, y) private(i, j)
  {
    #pragma omp for // schedule(static)
    for (i = 0; i < y.size(); i++) { // row major
      decltype(y(0)) sum = 0;

      #pragma omp simd reduction(+:sum)
      for (j = 0; j < x.size(); j++) {
        sum += A(i, j) * x(j);
      }

      y(i) += sum;
    }
  }

  return y;
}

Note that it may not be necessary faster because some compilers are able to automatically vectorize the code (ICC for example).请注意，可能不需要更快，因为某些编译器能够自动矢量化代码（例如 ICC）。 GCC and Clang often fail to perform (advanced) SIMD reductions automatically and such a directive help them a bit. GCC 和 Clang 经常无法自动执行（高级）SIMD 缩减，这样的指令对他们有所帮助。 You can check the assembly code to check how the code is vectorized or enable vectorization reports (see here for GCC).您可以检查汇编代码以检查代码是如何矢量化的或启用矢量化报告（有关 GCC，请参见此处）。

在矩阵向量乘法中使用 OpenMP“for simd”？

问题描述

1 个解决方案

解决方案1
1 2021-05-02 17:28:51

在矩阵向量乘法中使用 OpenMP“for simd”？

问题描述

1 个解决方案

解决方案1 1 2021-05-02 17:28:51

解决方案1
1 2021-05-02 17:28:51