LU decomposition using openmp

Question

I have a problem: parallel version of LU decomposition algorithm is running same time as sequence:

void lup_od_omp(double* a, int n){

int i,j,k;

for(k = 0; k < n - 1; ++k)
{
    #pragma omp parallel for shared(a,n,k) private(i,j)
    for(i = k + 1; i < n; i++)
    {
        a[i*n + k] /= a[k*n + k];
        for(j = k + 1; j < n; j++)
        {
            a[i*n + j] -= a[i*n + k]*a[k*n + j];
        }
    }
}}

Maybe i'm doing something wrong?

Answer 1

Since you are only working on two cores your parallelisation may actually get in the way of the vectoriser. Vectorisation on SSE2 will give you a data bandwidth of 2 doubles per op, 4 on AVX.

Dual thread has a lot of synchronisation overhead which may slow you down especially if you loose vectorisation. Also for some reason your #pragma omp does not start any threads unless omp_set_num_threads was invoked to actually make it use threads.

Another thing which is also related to vectorisation is that not all compilers understand that a[i*n + j] is intended to address a two-dimensional array, so it's better to declare it as such in the first place.

Here is a slight optimisation of your code that runs fairly well on my Xeon:

void lup_od_omp(int n, double (*a)[n]){
    int i,k;

    for(k = 0; k < n - 1; ++k) {
        // for the vectoriser
        for(i = k + 1; i < n; i++) {
            a[i][k] /= a[k][k];
        }

        #pragma omp parallel for shared(a,n,k) private(i) schedule(static, 64)
        for(i = k + 1; i < n; i++) {
            int j;
            const double aik = a[i][k]; // some compilers will do this automatically
            for(j = k + 1; j < n; j++) {
                a[i][j] -= aik * a[k][j];
            }
        }
    }
}

Runtimes for an array of 3000x3000 icc -O2 :

Your code sequential:  0:24.61 99%  CPU
Your code 8 threads :  0:05.21 753% CPU
My   code sequential:  0:18.53 99%  CPU
My   code 8 threads :  0:05.42 766% CPU

And on a different machine I tested it on AVX (256-bit vectors, 4 doubles per op):

My code on AVX sequential :  0:09.45 99%  CPU
My code on AVX 8 threads  :  0:03.92 766% CPU

As you can see I have improved the vectoriser a little, but didn't do much for the parallel section.

Answer 2

The main problem of your code is that you decomposed the workload in a bad way.

For a single LU decomposition, you invoke parallel for n-1 times. On each time, parallel for will do thread fork and join, which introduces a lot of overhead. Especially when k is large, the inner loop ( for(i){for(j){...}} ) contains only very little work. Paralleling it will be quite inefficient.

You may consider use proper agglomeration schemes to reduce the overhead. For more info please refer to this slides.

http://courses.engr.illinois.edu/cs554/notes/06_lu_8up.pdf

On the other hand, you could use existing performance libraries to get the max performance on LU factorization, such as Intel MKL

http://software.intel.com/en-us/node/468682

LU decomposition using openmp

Question

2 answers

solution1
4 ACCPTED 2013-10-14 18:06:24

solution2
1 2013-10-14 17:59:13

LU decomposition using openmp

Question

2 answers

solution1 4 ACCPTED 2013-10-14 18:06:24

solution2 1 2013-10-14 17:59:13

solution1
4 ACCPTED 2013-10-14 18:06:24

solution2
1 2013-10-14 17:59:13