Multithreaded Program for Sparse Matrices

Question

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.) I also have sequential codes in between which i donot intend to parallelise.

My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.

Also where should I set the number of threads?

Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)  
    rm[i] = b[i] - rm[i];

#pragma omp barrier


#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)  
    xm[i] = x0[i];
#pragma omp barrier


double* pm = (double*) malloc(numcols*sizeof(double));


    #pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)  
    pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);

Thanks

EDIT:

for the scalar dotproduct, I am using the following piece of code:

double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10; 
int i;
//double* c = (double*) malloc(n*sizeof(double));

omp_set_num_threads(4);

// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
    double pprod = 0.0;
    #pragma omp for
    for(i=0;i<n;i++) {
        pprod += vec1[i]*vec2[i];
    }

    //#pragma omp for reduction (+:prod)
    #pragma omp critical
    for(i=0;i<n;i++) {
        prod += pprod;
    }
}


return prod;
}

I have now added the time calculation code in my ConjugateGradient function as below:

start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
    run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);

Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110

I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.

I am currently using a matrix of size n = 5000.

I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).

Please suggest some improvements. Any help is much appreciated!

Answer 1

Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallel s you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.

So in your case at the very least I would try:

Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region

// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
    rm[i] = b[i] - rm[i]; // (1)
    xm[i] = x0[i];        // (2) does not require (1)
    pm[i] = rm[i];        // (3) requires (1) at this i, not (2)
}  
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel

scalarProd(rm,rm,numcols);

Notice how I show that no barriers are actually necessary between your loops anyway.

If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd() , so the amount of time you'll be saving is probably minimal.

** EDIT **

And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:

double scalarProd(double* vec1, double* vec2, int n) {
    double prod = 0.0;
    int i;

    // omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
    #pragma omp parallel for private(i) reduction(+:prod)
    for (i = 0; i < n; ++i) {
        prod += vec1[i]*vec2[i];
    }
    return prod;
}

(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.

Multithreaded Program for Sparse Matrices

Question

1 answers

solution1
0 2016-04-28 15:24:12

Multithreaded Program for Sparse Matrices

Question

1 answers

solution1 0 2016-04-28 15:24:12

solution1
0 2016-04-28 15:24:12