About the combination of OpenMP and -Ofast

Question

I implemented OpenMP parallelization in a for loop where I have a sum that is the principal cause of slowing down my code. When I did so, I found out that the final results were not the same that I obtained for the non-parallelize code (which is written in C). So first, one might think "well, I just didn't implemented well the parallelization" but the curious thing is that when I run the parallelized code with the -Ofast optimization suddenly the results are correct.

That would be:

-O0 correct
-Ofast correct
OMP -O0 wrong
OMP -O1 wrong
OMP -O2 wrong
OMP -O3 wrong
OMP -Ofast correct!

What could -Ofast be doing that solves an error that only appears when I implement openmp? Any recommendation of what could I check or test? Thanks!

EDIT Here I include the smallest version of my code that still reproduces the problem.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

#define LENGTH 100
#define R 50.0
#define URD 1.0/sqrt(2.0)
#define PI (4.0*atan(1.0)) //pi

const gsl_rng_type * Type;
gsl_rng * item;

double CalcDeltaEnergy(double **M,int sx,int sy){
    double DEnergy,r,zz;
    int k,j;
    double rrx,rry;
    int rx,ry;
    double Energy, Cpm, Cmm, Cmp, Cpp;
    DEnergy = 0;

    //OpenMP parallelization:
    #pragma omp parallel for reduction (+:DEnergy)
    for (int index = 0; index < LENGTH*LENGTH; index++){
        k = index % LENGTH;
        j = index / LENGTH;

    zz = 0.5*(1.0 - pow(-1.0, k + j + sx + sy));
    for (rx = -1; rx <= 1; rx++){
        for (ry = -1; ry <= 1; ry++){
            rrx = (sx - k - rx*LENGTH)*URD;
            rry = (sy - j - ry*LENGTH)*URD;

            r = sqrt(rrx*rrx + rry*rry + zz);
            if(r != 0 && r <= R){
                Cpm = sqrt((rrx+0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy])))*(rrx+0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy]))) + (rry+0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy])))*(rry+0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy]))) + zz);
                Cmm = sqrt((rrx-0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy])))*(rrx-0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy]))) + (rry-0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy])))*(rry-0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy]))) + zz);
                Cpp = sqrt((rrx+0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy])))*(rrx+0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy]))) + (rry+0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy])))*(rry+0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy]))) + zz);
                Cmp = sqrt((rrx-0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy])))*(rrx-0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy]))) + (rry-0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy])))*(rry-0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy]))) + zz);
                Cpm = 1.0/Cpm;
                Cmm = 1.0/Cmm;
                Cpp = 1.0/Cpp;
                Cmp = 1.0/Cmp;
                Energy = (Cpm + Cmm - Cpp - Cmp)/(0.702*0.702); // S=cte=1

                DEnergy -= 2.0*Energy;
            }
        }
    }
    }
return DEnergy;
}

void Initialize(double **M){
double random;
    for(int i=0;i<(LENGTH-1);i=i+2){
          for(int j=0;j<(LENGTH-1);j=j+2) {
              random=gsl_rng_uniform(item);
              if (random<0.5) M[i][j]=PI/4.0;
              else M[i][j]=5.0*PI/4.0;

              random=gsl_rng_uniform(item);
              if (random<0.5) M[i][j+1]=3.0*PI/4.0;
              else M[i][j+1]=7.0*PI/4.0;

              random=gsl_rng_uniform(item);
              if (random<0.5) M[i+1][j]=3.0*PI/4.0;
              else M[i+1][j]=7.0*PI/4.0;

              random=gsl_rng_uniform(item);
              if (random<0.5) M[i+1][j+1]=PI/4.0;
              else M[i+1][j+1]=5.0*PI/4.0;
          }
    }
}

int main(){
    //Choose and initiaze the random number generator
    gsl_rng_env_setup();
    Type = gsl_rng_default; //default=mt19937, ran2, lxs0
    item = gsl_rng_alloc (Type);

    double **S; //site matrix
    S = (double **) malloc(LENGTH*sizeof(double *));
    for (int i = 0; i < LENGTH; i++)
        S[i] = (double *) malloc(LENGTH*sizeof(double ));

    //Initialization
    Initialize(S);

    int l,m;
    for (int cl = 0; cl < LENGTH*LENGTH; cl++) {
        l = gsl_rng_uniform_int(item, LENGTH); // RNG[0, LENGTH-1]
        m = gsl_rng_uniform_int(item, LENGTH); // RNG[0, LENGTH-1]
        printf("%lf\n", CalcDeltaEnergy(S, l, m));
    }


    //Free memory
    for (int i = 0; i < LENGTH; i++)
        free(S[i]);
    free(S);
    return 0;
}

I compile with:

g++ [optimization] -lm test.c -o test.x -lgsl -lgslcblas -fopenmp

and run with:

GSL_RNG_SEED=123; ./test.x > test.dat

Comparing the outputs for different optimizations one can see what I stated before.

Answer 1

Disclaimer: I have little to no experience with OpenMP

It's probably a race condition you run into when using OpenMP.

You'll need to declare all those variables inside the OpenMP loop to be private . One core may calculate their values for a certain value of index , which gets promptly recalculated to different values on a core that uses another value of index : the variables such as k , j , rrx , rry etc are shared between the compute nodes.

Instead of using a pragma like

#pragma omp parallel for private(k,j,zz,rx,ry,rrx,rry,r,Cpm,Cmm,Cpp,Cmp,Energy) reduction (+:D\

(credits to comment by Zulan below:) you can also declare the variables inside the parallel region, as locally as possible. This makes them private implicitly and is less prone to initialization issues and easier to reason about.

(You could even consider putting everything inside the outer for-loop (over index ) in a function: the function call overhead is minimal compared to the calculations.)

As to why `-Ofast` together with OpenMP does actually produce correct output.

My guess is: mostly luck. Here's what -Ofast does (gcc manual):

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math [...]

Here's the section on -ffast-math :

This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.

Thus, the sqrt , cos and sin will likely be a lot speedier. My guess is, that in this case, the calculations of the variables inside the outer loop don't bite each other, since the individual threads are so fast, they don't conflict. But that is a very handwavingly explanation and guess.

About the combination of OpenMP and -Ofast

Question

1 answers

solution1
2 ACCPTED

It's probably a race condition you run into when using OpenMP.

As to why `-Ofast` together with OpenMP does actually produce correct output.

About the combination of OpenMP and -Ofast

Question

1 answers

solution1 2 ACCPTED

It's probably a race condition you run into when using OpenMP.

As to why -Ofast together with OpenMP does actually produce correct output.

solution1
2 ACCPTED

As to why `-Ofast` together with OpenMP does actually produce correct output.