优化和为什么openmp比顺序方式慢得多？

Question

I am a newbie in programming with OpenMp. 我是OpenMp编程的新手。 I wrote a simple c program to multiply matrix with a vector. 我写了一个简单的c程序来将矩阵与向量相乘。 Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way. 不幸的是，通过比较执行时间，我发现OpenMP比Sequential方式慢得多。

Here is my code (Here the matrix is N*N int, vector is N int, result is N long long): 这是我的代码（这里的矩阵是N * N int，vector是N int，结果是N long long）：

#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size)
for(i=0;i<m_size;i++)
{  
  for(j=0;j<m_size;j++)
  {  
    result[i]+=matrix[i][j]*vector[j];
  }
}

And this is the code for sequential way: 这是顺序方式的代码：

for (i=0;i<m_size;i++)
        for(j=0;j<m_size;j++)
            result[i] += matrix[i][j] * vector[j];

When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is: 当我使用999x999矩阵和999向量尝试这两个实现时，执行时间为：

Sequential: 5439 ms Parallel: 11120 ms 顺序：5439 ms并行：11120 ms

I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem? 我真的不明白为什么OpenMP比顺序算法慢得多（慢2倍！）任何人都可以解决我的问题？

Answer 1

Your code partially suffers from the so-called false sharing , typical for all cache-coherent systems. 您的代码部分遭受所谓的错误共享，这是所有缓存一致系统的典型代表。 In short, many elements of the result[] array fit in the same cache line. 简而言之， result[]数组的许多元素都适合同一个缓存行。 When thread i writes to result[i] as a result of the += operator, the cache line holding that part of result[] becomes dirty. 当线程i作为+=运算符的结果写入result[i] ，保存result[]部分的高速缓存行变脏。 The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. 然后，高速缓存一致性协议使其他核心中的该高速缓存行的所有副本无效，并且它们必须从高级高速缓存或主存储器刷新其副本。 As result is an array of long long , then one cache line (64 bytes on x86) holds 8 elements and besides result[i] there are 7 other array elements in the same cache line. result是一个long long的数组，然后一个高速缓存行（x86上的64个字节）保存8个元素，除了result[i] ，在同一个高速缓存行中还有7个其他数组元素。 Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core). 因此，两个“相邻”线程可能会不断争夺高速缓存行的所有权（假设每个线程在一个单独的核心上运行）。

To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. 为了减轻您的情况下的错误共享，最简单的方法是确保每个线程都获得一个迭代块，其大小可以被缓存行中的元素数量整除。 For example you can apply the schedule(static,something*8) where something should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. 例如，您可以应用schedule(static,something*8) ，其中something应该足够大，以便迭代空间不会碎片化为太多碎片，但同时它应该足够小，以便每个线程获得一个块。 Eg for m_size equal to 999 and 4 threads you would apply the schedule(static,256) clause to the parallel for construct. 例如，对于m_size等于999和4个线程，您可以将schedule(static,256)子句应用于parallel for construct。

Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. 代码运行速度较慢的另一个部分原因可能是，当启用OpenMP时，编译器可能不愿意在分配共享变量时应用某些代码优化。 OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush construct is provided in order to synchronise the views. OpenMP提供了所谓的宽松内存模型，允许每个线程中共享变量的本地内存视图不同，并提供flush构造以同步视图。 But compilers usually see shared variables as being implicitly volatile if they cannot prove that other threads would not need to access desynchronised shared variables. 但是编译器通常会将共享变量视为隐式volatile如果它们无法证明其他线程不需要访问去同步的共享变量。 You case is one of those, since result[i] is only assigned to and the value of result[i] is never used by other threads. 你的情况就是其中之一，因为result[i]只被分配给，而result[i]值从未被其他线程使用过。 In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i] once the inner loop has finished. 在串行情况下，编译器很可能会创建一个临时变量来保存内部循环的结果，并且只有在内部循环结束后才会分配给result[i] 。 In the parallel case it might decide that this would create a temporary desynchronised view of result[i] in the other threads and hence decide not to apply the optimisation. 在并行的情况下，它可能会决定这将在其他线程中创建result[i]的临时去同步视图，因此决定不应用优化。 Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize does the temporary variable trick with both OpenMP enabled and not. 仅仅为了记录，带有-O3 -ftree-vectorize GCC 4.7.1在启用OpenMP和不启用OpenMP时执行临时变量技巧。

Answer 2

Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. 因为当OpenMP在线程之间分配工作时，会进行大量的管理/同步，以确保共享矩阵和向量中的值不会以某种方式损坏。 Even though they are read-only: humans see that easily, your compiler may not. 尽管它们是只读的：人类很容易看到，但编译器可能没有。

Things to try out for pedagogic reasons: 出于教学原因尝试的事情：

0) What happens if matrix and vector are not shared ? 0）如果不shared matrix和vector会发生什么？

1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. 1）首先并行化内部“j-loop”，保持外部“i-loop”串行。 See what happens. 走着瞧吧。

2) Do not collect the sum in result[i] , but in a variable temp and assign its contents to result[i] only after the inner loop is finished to avoid repeated index lookups. 2）不要在result[i]收集和，而是在变量temp ，只有在内部循环结束后才将其内容分配给result[i]以避免重复的索引查找。 Don't forget to init temp to 0 before the inner loop starts. 在内循环开始之前，不要忘记将temp为0。

Answer 3

I did this in reference to Hristo's comment. 我是在参考Hristo的评论时这样做的。 I tried using schedule(static, 256). 我尝试使用schedule（static，256）。 For me it makes it does not help changing the default chunck size. 对我而言，它无助于更改默认的chunck大小。 Maybe it even makes it worse. 也许它甚至会使情况变得更糟。 I printed out the thread number and its index with and without setting the schedule and it's clear that OpenMP already chooses the thread indices to be far from one another so that false sharing does not seem to be an issue. 我打印出线程号及其索引，无论是否设置了时间表，很明显OpenMP已经选择了线程索引彼此远离，因此错误共享似乎不是问题。 For me this code already gives a good boost with OpenMP. 对我来说，这个代码已经为OpenMP提供了很好的推动力。

#include "stdio.h"
#include <omp.h>

void loop_parallel(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
    #pragma omp parallel for schedule(static, 250)
    //#pragma omp parallel for
    for (int i=0;i<m_size;i++) {
        //printf("%d %d\n", omp_get_thread_num(), i);
        long long sum = 0;
        for(int j=0;j<m_size;j++) {
            sum += matrix[i*ld +j] * vector[j];
        }
        result[i] = sum;
    }
}

void loop(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
    for (int i=0;i<m_size;i++) {
        long long sum = 0;
        for(int j=0;j<m_size;j++) {
            sum += matrix[i*ld +j] * vector[j];
        }
        result[i] = sum;
    }
}

int main() {
    const int m_size = 1000;
    int *matrix = new int[m_size*m_size];
    int *vector = new int[m_size];
    long long*result = new long long[m_size];
    double dtime;

    dtime = omp_get_wtime();
    loop(matrix, m_size, vector, result, m_size);
    dtime = omp_get_wtime() - dtime;
    printf("time %f\n", dtime);

    dtime = omp_get_wtime();
    loop_parallel(matrix, m_size, vector, result, m_size);
    dtime = omp_get_wtime() - dtime;
    printf("time %f\n", dtime);

}

优化和为什么openmp比顺序方式慢得多？

问题描述

3 个解决方案

解决方案1
14 2013-05-05 02:45:57

解决方案2
2 已采纳 2013-05-04 07:13:52

解决方案3
0

优化和为什么openmp比顺序方式慢得多？

问题描述

3 个解决方案

解决方案1 14 2013-05-05 02:45:57

解决方案2 2 已采纳 2013-05-04 07:13:52

解决方案3 0

解决方案1
14 2013-05-05 02:45:57

解决方案2
2 已采纳 2013-05-04 07:13:52

解决方案3
0