Performance of matrix multiplications remains unchanged with OpenMP in C++

Question

auto t1 = chrono::steady_clock::now();
     #pragma omp parallel
     {

        for(int i=0;i<n;i++)
        {
            #pragma omp for collapse(2)
            for(int j=0;j<n;j++)
            {

                for(int k=0;k<n;k++)
                {
                    C[i][j]+=A[i][k]*B[k][j];
                }

            }
        }
     }
auto t2 = chrono::steady_clock::now();

auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();

With and without the parallelization the variable t remains fairly constant. I am not sure why this is happening. Also once in a while t is outputted as 0. One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100) I am using code::blocks with the GNU GCC compiler.

Answer 1

The proposed OpenMP parallelization is not correct and may lead to wrong results. When specifying collapse(2) , threads execute "simultaneously" the (j,k) iterations. If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j] . This is a so called race condition, ie "two or more threads can access shared data and they try to change it at the same time" ( What is a race condition? ). Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). To fix the problem in the code above, OpenMP offers the reduction clause:

#pragma omp parallel
    {
        for(int i=0;i<n;i++)  {
            #pragma omp for collapse(2) reduction(+:C)
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" ( http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf ). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp ).

However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all:

#pragma omp parallel
    {
        #pragma omp for collapse(2)
        for(int i=0;i<n;i++)  {
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads. On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. For clarity I report the performances I measured:

Serial      418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1   1924950        2841993        1450686          1455989
OpenMP-2    988743        2446098         747333           745830
OpenMP-4    515266        3182262         396524           387671
OpenMP-8    280285        5510023         219506           211913  
OpenMP-16  2227567       10807828         150277           123368

Using reduction the performance loss is dramatic (reversed speed-up). The outer parallelization (w or w/o collapse) is the best option.

As concerns your failure with large matrices, a possible reason is related to the size of the available stack. Try to enlarge both the system and OpenMP stack sizes, ie

ulimit -s unlimited
export OMP_STACKSIZE=10000000

Answer 2

The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations .

Did you try without collapse ?

Performance of matrix multiplications remains unchanged with OpenMP in C++

Question

2 answers

solution1
1 ACCPTED 2017-05-14 00:29:47

solution2
0 2017-05-12 11:56:37

Performance of matrix multiplications remains unchanged with OpenMP in C++

Question

2 answers

solution1 1 ACCPTED 2017-05-14 00:29:47

solution2 0 2017-05-12 11:56:37

solution1
1 ACCPTED 2017-05-14 00:29:47

solution2
0 2017-05-12 11:56:37