使用C ++中的OpenMP，矩阵乘法的性能保持不变

Question

auto t1 = chrono::steady_clock::now();
     #pragma omp parallel
     {

        for(int i=0;i<n;i++)
        {
            #pragma omp for collapse(2)
            for(int j=0;j<n;j++)
            {

                for(int k=0;k<n;k++)
                {
                    C[i][j]+=A[i][k]*B[k][j];
                }

            }
        }
     }
auto t2 = chrono::steady_clock::now();

auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();

With and without the parallelization the variable t remains fairly constant. 有和没有并行化，变量t保持相当恒定。 I am not sure why this is happening. 我不确定为什么会这样。 Also once in a while t is outputted as 0. One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100) I am using code::blocks with the GNU GCC compiler. 同样，有时t输出为0。我面临的另一个问题是，如果我将n的值增加到500，那么编译器将无法运行该程序。（这里我取n = 100）我我在GNU GCC编译器中使用code :: blocks。

Answer 1

The proposed OpenMP parallelization is not correct and may lead to wrong results. 提议的OpenMP并行化是不正确的，并且可能导致错误的结果。 When specifying collapse(2) , threads execute "simultaneously" the (j,k) iterations. 当指定collapse(2) ，线程“同时”执行（j，k）迭代。 If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j] . 如果两个（或更多）线程在相同的j上但不同的k上工作，则它们将A[i][k]*B[k][j]的结果累加到相同的数组位置C[i][j] 。 This is a so called race condition, ie "two or more threads can access shared data and they try to change it at the same time" ( What is a race condition? ). 这就是所谓的竞争条件，即“两个或多个线程可以访问共享数据，并且它们试图同时更改共享数据”（什么是竞争条件？）。 Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). 尽管代码对OpenMP无效，但数据争用并不一定会导致错误的结果，并且可能会根据多个因素（调度，编译器实现，线程数等）产生错误的结果。 To fix the problem in the code above, OpenMP offers the reduction clause: 为了解决上面代码中的问题，OpenMP提供了reduction条款：

#pragma omp parallel
    {
        for(int i=0;i<n;i++)  {
            #pragma omp for collapse(2) reduction(+:C)
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" ( http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf ). 因此，“将在每个隐式任务（...）中创建一个私有副本，并使用reducer-identifier的初始化值对其进行初始化。在该区域结束之后，原始列表项将使用私有副本的值进行更新。使用与归约标识符关联的组合器”（ http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf ）。 Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp ). 请注意，自OpenMP 4.5起，该标准直接支持C语言中数组的缩减（请检查编译器是否支持它，否则，有一些旧的手动方法可以实现，即OpenMp中的数组缩减）。

However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all: 但是，对于给定的代码，应该避免最内层循环的并行化可能更合适，这样根本就不需要减少操作：

#pragma omp parallel
    {
        #pragma omp for collapse(2)
        for(int i=0;i<n;i++)  {
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads. 对于较小的矩阵和/或少量的线程，串行可以比OpenMP版本更快。 On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. 在使用最多16个内核，n = 1000的Intel机器上，当激活-O3优化时，GNU编译器v6.1的收支平衡约为4个内核，而使用-O0编译的收支平衡约为2个内核。 For clarity I report the performances I measured: 为了清楚起见，我报告了我测量的性能：

Serial      418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1   1924950        2841993        1450686          1455989
OpenMP-2    988743        2446098         747333           745830
OpenMP-4    515266        3182262         396524           387671
OpenMP-8    280285        5510023         219506           211913  
OpenMP-16  2227567       10807828         150277           123368

Using reduction the performance loss is dramatic (reversed speed-up). 减少使用会导致性能损失（加速倒退）。 The outer parallelization (w or w/o collapse) is the best option. 外部并行化（不折叠或不折叠）是最佳选择。

As concerns your failure with large matrices, a possible reason is related to the size of the available stack. 关于大型矩阵的故障，可能的原因与可用堆栈的大小有关。 Try to enlarge both the system and OpenMP stack sizes, ie 尝试同时扩大系统和OpenMP堆栈的大小，即

ulimit -s unlimited
export OMP_STACKSIZE=10000000

Answer 2

The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations . collapse指令实际上可能对此负责，因为索引j是使用divide / mod操作重新创建的。

Did you try without collapse ? 您尝试没有collapse吗？

使用C ++中的OpenMP，矩阵乘法的性能保持不变

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-05-14 00:29:47

解决方案2
0 2017-05-12 11:56:37

使用C ++中的OpenMP，矩阵乘法的性能保持不变

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-05-14 00:29:47

解决方案2 0 2017-05-12 11:56:37

解决方案1
1 已采纳 2017-05-14 00:29:47

解决方案2
0 2017-05-12 11:56:37