简体   繁体   English

使用 openMP 并行化矩阵向量乘法的最佳方法

[英]Best way to Parallelizing matrix vector multiplication using openMP

I have the following code, which I have parallelized using openMP:我有以下代码,我使用 openMP 对其进行了并行化:

#pragma omp parallel shared(matrix, result, vector) private(i, j)
  {
#pragma omp for schedule(static)
    for (i = 0; i < n; i++)
    {
      for (j = 0; j <= i && j < n; k++)
      {

        result[i] += matrix[i * n + j] * vector[j];
      }
    }
  }

I have added the above pragma directive to the for loop, which calculates the product of a Matrix and a column vector. 我已将上述 pragma 指令添加到 for 循环中,它计算矩阵和列向量的乘积。 It does speed up things. 它确实加快了速度。 But, could there be a more efficient way to 但是,有没有更有效的方法来
speed things up using OpenMP? 使用 OpenMP 加快速度? I tried with the different types of schedules static, dynamic, runtime, guided, auto. 我尝试了不同类型的计划静态、动态、运行时、引导式、自动。 Static and auto seem to give the best results nearly for matrices as large as 30000 x 30000. The Matrix has the property that matrix[i][j]=0 if j>i 对于 30000 x 30000 大的矩阵,静态和自动似乎给出了最好的结果。如果 j>i,矩阵具有矩阵 [i][j]=0 的属性
  1. It may help the compiler to optimize the code if you use a local temporary variable to sum the result of j loop.如果使用局部临时变量对j循环的结果求和,可能有助于编译器优化代码。 Your compiler may also do this, but if not, it will be much faster.您的编译器也可能会这样做,但如果不这样做,它会快得多。

  2. Always use your variables in their minimum required scope, it also helps the compiler to optimize.始终在所需的最小范围内使用变量,这也有助于编译器进行优化。

  3. Make sure your compiler can effectively vectorize your code: use the appropriate compiler flags, and if you use pointers tell the compiler that there is no loop carried dependency by using restrict keyword or by adding #pragma ivdep (Intel compiler), #pragma gcc ivdep (GCC), #pargma loop(ivdep) (MSVC), #pragma clang loop vectorize(assume_safety) (clang) before the inner loop.确保您的编译器可以有效地矢量化您的代码:使用适当的编译器标志,如果您使用指针,请通过使用restrict关键字或添加#pragma ivdep (英特尔编译器)、 #pragma gcc ivdep告诉编译器没有循环携带依赖(GCC)、 #pargma loop(ivdep) (MSVC)、 #pragma clang loop vectorize(assume_safety) (clang) 在内循环之前。

So, your code should look something like this:所以,你的代码应该是这样的:

#pragma omp parallel for shared(matrix, result, vector) schedule(static)
    for (size_t i = 0; i < n; i++)
    {
      double sum=0;
      #pragma GCC ivdep
      for (size_t j = 0; j <= i; j++) //as suggested by @tstanisl
      {
        sum += matrix[i * n + j] * vector[j];
      }
      result[i] += sum;
    }
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM