使用 OpenMP 按列和行并行化矩阵乘以向量

Question

For some homework I have, I need to implement the multiplication of a matrix by a vector, parallelizing it by rows and by columns.对于我的一些作业，我需要实现矩阵与向量的乘法，并按行和列对其进行并行化。 I do understand the row version, but I am a little confused in the column version.我确实理解行版本，但我对列版本有点困惑。

Lets say we have the following data:假设我们有以下数据：

Matix 时间向量

And the code for the row version:以及行版本的代码：

#pragma omp parallel default(none) shared(i,v2,v1,matrix,tam) private(j)
  {
#pragma omp for
    for (i = 0; i < tam; i++)
      for (j = 0; j < tam; j++){
//        printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
        v2[i] += matrix[i][j] * v1[j];
      }
  }

Here the calculations are done right and the result is correct.这里计算正确，结果正确。

The column version:列版本：

#pragma omp parallel default(none) shared(j,v2,v1,matrix,tam) private(i)
  {
    for (i = 0; i < tam; i++)
#pragma omp for
      for (j = 0; j < tam; j++) {
//            printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
        v2[i] += matrix[i][j] * v1[j];
      }
  }

Here, due to how the parallelization is done, the result varies on each execution depending on who thread execute each column.在这里，由于并行化是如何完成的，每次执行的结果都会有所不同，具体取决于谁执行每一列的线程。 But it happens something interesting, (And I would think is because of compiler optimizations) if I uncomment the printf then the results all the same as the row version and therefore, correct, for example:但它发生了一些有趣的事情，（我认为是因为编译器优化）如果我取消对printf注释，那么结果与行版本相同，因此，正确，例如：

Thread 0 did 0,0
Thread 2 did 0,2
Thread 1 did 0,1
Thread 2 did 1,2
Thread 1 did 1,1
Thread 0 did 1,0
Thread 2 did 2,2
Thread 1 did 2,1
Thread 0 did 2,0

 2.000000  3.000000  4.000000 
 3.000000  4.000000  5.000000 
 4.000000  5.000000  6.000000 


V2:
20.000000, 26.000000, 32.000000,

Is right, but If I remove the printf:是对的，但是如果我删除 printf：

V2:
18.000000, 11.000000, 28.000000,

What kind of mechanism should I use to get the column version right?我应该使用什么样的机制来获得正确的列版本？

Note : I care more about the explanation rather than the code you may post as answer, because what I really want is understand what is going wrong in the column version.注意：我更关心解释而不是您可能作为答案发布的代码，因为我真正想要的是了解列版本中出了什么问题。

EDIT编辑

I've found a way of get rid of the private vector proposed by Z boson in his answer.我找到了一种摆脱 Z 玻色子在他的回答中提出的私有向量的方法。 I've replaced that vector by a variable, here is the code:我已经用变量替换了该向量，这是代码：

    #pragma omp parallel
      {
        double sLocal = 0;
        int i, j;
        for (i = 0; i < tam; i++) {
    #pragma omp for
          for (j = 0; j < tam; j++) {
            sLocal += matrix[i][j] * v1[j];
          }
    #pragma omp critical
          {
            v2[i] += sLocal;
            sLocal = 0;
          }
        }
      }

Answer 1

I don't know exactly what your homework means by parallelize along row and column but I know why your code is not working.我不知道你的作业通过行和列并行化意味着什么，但我知道为什么你的代码不起作用。 You have a race condition when you write to v2[i] .当您写入v2[i]时，您会遇到竞争条件。 You can fix it by making private versions of v2[i] , filling them in parallel, and then merging them with a critical section.您可以通过制作v2[i]私有版本，并行填充它们，然后将它们与关键部分合并来修复它。

#pragma omp parallel
{
    float v2_private[tam] = {};
    int i,j;
    for (i = 0; i < tam; i++) {
        #pragma omp for
        for (j = 0; j < tam; j++) {
            v2_private[i] += matrix[i][j] * v1[j];
        }
    }
    #pragma omp critical
    {
        for(i=0; i<tam; i++) v2[i] += v2_private[i];
    }
}

I tested this.我测试了这个。 You can see the results here http://coliru.stacked-crooked.com/a/5ad4153f9579304d你可以在这里看到结果http://coliru.stacked-crooked.com/a/5ad4153f9579304d

Note that I did not explicitly define anything shared or private.请注意，我没有明确定义任何共享或私有的内容。 It's not necessary to do.没有必要这样做。 Some people think you should explicitly define everything.有些人认为你应该明确定义一切。 I personalty think the opposite.我个人认为相反。 By defining i and j (and v2_private ) inside the parallel section they are made private.通过在并行部分中定义i和j （和v2_private ），它们被v2_private私有。

Answer 2

I'd say the row version is more efficient because there is no private storage required for each thread and there is no need to use critical section or mutex for partial summation.我会说行版本更有效，因为每个线程不需要私有存储，并且不需要使用临界区或互斥锁进行部分求和。 The code is also much simpler:代码也简单得多：

#pragma omp parallel for
for (int i = 0; i < tam; i++) {
    for (int j = 0; j < tam; j++) {
        v2[i] += matrix[i][j] * v1[j];
    }
}

使用 OpenMP 按列和行并行化矩阵乘以向量

问题描述

EDIT编辑

2 个解决方案

解决方案1
6 已采纳 2014-04-24 19:20:37

解决方案2
1 2021-03-19 08:39:16

使用 OpenMP 按列和行并行化矩阵乘以向量

问题描述

EDIT编辑

2 个解决方案

解决方案1 6 已采纳 2014-04-24 19:20:37

解决方案2 1 2021-03-19 08:39:16

解决方案1
6 已采纳 2014-04-24 19:20:37

解决方案2
1 2021-03-19 08:39:16