incomprehensible performance improvement with openmp even when num_threads(1)

Question

The following lines of code

int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );

unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

take 11130 usecs to run on my i5-3230M when compiled with

g++ -o main main.cpp -std=c++0x -O3

That is, when the openmp pragmas are ignored.

On the other hand, it only takes 1496 usecs when compiled with

g++ -o main main.cpp -std=c++0x -O3 -fopenmp

This is more than 6 times faster, which is quite surprising taking into acount that it is run on a 2-core machine. In fact, I have also tested it with num_threads(1) and the performance improvement is still quite important (more than 3 times faster).

Anybody can help me to understand this behaviour?

EDIT: following the suggestions, I provide the full piece of code:

#include <stdlib.h>
#include <iostream>

#include <chrono>
#include <cassert>


int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;


void func()
{
    unsigned char *pbuff = buff;
    #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
    for (int i=0; i<nrows; i++)
    {
        for (int j=0; j<ncols; j++)
        {
            *pbuff += 1;
            pbuff++;
        }
    }
}


int main()
{
    // alloc & initializacion
    buff = (unsigned char *) malloc( numel );
    assert(buff != NULL);
    for(int k=0; k<numel; k++)
        buff[k] = 0;

    //
    std::chrono::high_resolution_clock::time_point begin;
    std::chrono::high_resolution_clock::time_point end;
    begin = std::chrono::high_resolution_clock::now();      
    //
    for(int k=0; k<100; k++)
        func();
    //
    end = std::chrono::high_resolution_clock::now();
    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
    std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;

    return 0;
}

Answer 1

The answer, as it turns out, is that firstprivate(pbuff, nrows, ncols) effectively declares pbuff , nrows and ncols as local variables within the scope of the for loop. That in turn means the compiler can see nrows and ncols as constants - it cannot make the same assumption about global variables!

Consequently, with -fopenmp , you end up with the huge speedup because you aren't accessing a global variable each iteration . (Plus, with a constant ncols value, the compiler gets to do a bit of loop unrolling).

By changing

int nrows = 4096;
int ncols = 4096;

to

const int nrows = 4096;
const int ncols = 4096;

or by changing

for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

to

int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
    for (int j=0; j<_ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

the anomalous speedup vanishes - the non-OpenMP code is now just as fast as the OpenMP code.

The moral of the story? Avoid accessing mutable global variables inside performance-critical loops.

incomprehensible performance improvement with openmp even when num_threads(1)

Question

1 answers

solution1
5 ACCPTED 2015-06-08 08:53:29

incomprehensible performance improvement with openmp even when num_threads(1)

Question

1 answers

solution1 5 ACCPTED 2015-06-08 08:53:29

solution1
5 ACCPTED 2015-06-08 08:53:29