优化循环以获得更好的性能

Question

I parallelized my code like that:我像这样并行化我的代码：

    for (int i=0; i<size; ++i) {
    
        #pragma omp parallel for
        for (int j=i; j<size; ++j) {
            int l = j+1;
            float sum = a[i*size+j];
            float sum2 = a[l*size+i];
            for (int k=0; k<i; ++k) {
                sum -= a[i*size+k] * a[k*size+j];
                sum2 -= a[l*size+k] * a[k*size+i];
            }
            a[i*size+j]=sum;
            a[l*size+i]=sum2;
        }
        
        #pragma omp parallel for
        for (int j=i+1; j<size; ++j) {
            a[j*size+i]/=a[i*size+i];
        }
    }

But I want it to be like this:但我希望它是这样的：

    for (int i=0; i<size; ++i) {
    
        #pragma omp parallel for
        for (int j=i; j<size; ++j) {
            int l = j+1;
            float sum = a[i*size+j];
            float sum2 = a[l*size+i];
            for (int k=0; k<i; ++k) {
                sum -= a[i*size+k] * a[k*size+j];
                sum2 -= a[l*size+k] * a[k*size+i];
            }
            a[i*size+j]=sum;
            a[l*size+i]=sum2;
            a[l*size+i]/=a[i*size+i];
        }
    }

So I can get better performance.所以我可以得到更好的表现。 However, if I'm putting a[l*size+i]/=a[i*size+i];但是，如果我输入a[l*size+i]/=a[i*size+i]; into the same loop as the other stuff, I'm getting a different result than I should.进入与其他东西相同的循环，我得到的结果与我应该得到的不同。 I guess it's because of the OpenMP directives because, without them, both have the same result.我想这是因为 OpenMP 指令，因为没有它们，两者都有相同的结果。

I would be happy if someone could give me some tips on how to make this possible or how to improve the performance in general.如果有人能给我一些关于如何使这成为可能或如何提高总体性能的提示，我会很高兴。

Answer 1

Without redesigning the code you can try something like:无需重新设计代码，您可以尝试以下操作：

    #pragma omp parallel
    {
        for (int i=0; i<size; ++i)
        {
            #pragma omp for
            for (int j=i; j<size; ++j) {
                 int l = j+1;
                 float sum = a[i*size+j];
                 float sum2 = a[l*size+i];
                 for (int k=0; k<i; ++k) {
                     sum -= a[i*size+k] * a[k*size+j];
                     sum2 -= a[l*size+k] * a[k*size+i];
                 }
                a[i*size+j]=sum;
                a[l*size+i]=sum2;
            }
            #pragma omp for
            for (int j=i+1; j<size; ++j)
                a[j*size+i]/=a[i*size+i];
      }
   }

Instead of creating the parallel region 2x times per loop i iterations (total of 2 * size parallel regions) you can create a single one.您可以创建一个单独的区域，而不是每次循环i迭代创建 2 次并行区域（总共 2 * 大小的并行区域）。 Nonetheless, in an efficient implementation of the OpenMP standard a new parallel region does not introduce that much of overhead that one might think, because typically the threads will be created the first time and reused on the next parallel regions.尽管如此，在 OpenMP 标准的有效实现中，新的并行区域不会引入人们可能认为的那么多开销，因为通常线程将在第一次创建并在下一个并行区域上重用。

Notwithstanding, one of the overheads of having multiple parallel regions is the call to the implicit barrier at the end of them.尽管如此，拥有多个并行区域的开销之一是调用它们末尾的隐式屏障。 Unfortunately, that overhead is still present on the version that I am presenting.不幸的是，我展示的版本中仍然存在这种开销。 To avoid that you would need to redesign the algorithm.为避免这种情况，您需要重新设计算法。

优化循环以获得更好的性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-06 14:55:19

优化循环以获得更好的性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-06 14:55:19

解决方案1
1 已采纳 2021-02-06 14:55:19