稀疏矩阵的多线程程序

Question

I am a newbie to multithreading. 我是多线程的新手。 I am trying to design a program that solves a sparse matrix. 我正在尝试设计一个解决稀疏矩阵的程序。 In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. 在我的代码中，我多次将Vector Vector点积和Matix vector product称为子例程，以得出最终解决方案。 I am trying to parallelise the code using open MP (Especially the above two sub routines.) I also have sequential codes in between which i donot intend to parallelise. 我正在尝试使用开放式MP（特别是上述两个子例程）对代码进行并行化。我也有顺序代码，在这些序列化代码之间，我不希望并行化。

My question is how do I handle the threads created when the sub routine is called. 我的问题是如何处理调用子例程时创建的线程。 Should I put a barrier at the end of every sub routine call. 我是否应该在每个子例程调用的末尾放置一个障碍。

Also where should I set the number of threads? 另外，我应该在哪里设置线程数？

Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)  
    rm[i] = b[i] - rm[i];

#pragma omp barrier


#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)  
    xm[i] = x0[i];
#pragma omp barrier


double* pm = (double*) malloc(numcols*sizeof(double));


    #pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)  
    pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);

Thanks 谢谢

EDIT: 编辑：

for the scalar dotproduct, I am using the following piece of code: 对于标量点积，我正在使用以下代码：

double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10; 
int i;
//double* c = (double*) malloc(n*sizeof(double));

omp_set_num_threads(4);

// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
    double pprod = 0.0;
    #pragma omp for
    for(i=0;i<n;i++) {
        pprod += vec1[i]*vec2[i];
    }

    //#pragma omp for reduction (+:prod)
    #pragma omp critical
    for(i=0;i<n;i++) {
        prod += pprod;
    }
}


return prod;
}

I have now added the time calculation code in my ConjugateGradient function as below: 现在，我在ConjugateGradient函数中添加了时间计算代码，如下所示：

start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
    run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);

Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110 观察结果：点积所需的时间顺序版本：0.000007s并行版本：0.002110

I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop. 我正在Intel I7笔记本电脑上的Linux OS上使用gcc -fopenmp命令进行简单的编译。

I am currently using a matrix of size n = 5000. 我目前正在使用大小为n = 5000的矩阵。

I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times). 总的来说，由于同一个点乘积被调用多次直到实现收敛（大约8万次），所以我的速度正在总体下降。

Please suggest some improvements. 请提出一些改进。 Any help is much appreciated! 任何帮助深表感谢！

Answer 1

Honestly, I would suggest parallelizing at a higher level. 老实说，我建议在更高层次上并行化。 By this I mean trying to minimize the number of #pragma omp parallel s you are using. 我的意思是要尽量减少您正在使用的#pragma omp parallel的数量。 Every time you try and split up the work among your threads, there is an OpenMP overhead. 每次尝试在线程之间分配工作时，都会产生OpenMP开销。 Try and avoid this whenever possible. 尝试并尽可能避免这种情况。

So in your case at the very least I would try: 因此，至少在您的情况下，我会尝试：

Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region

// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
    rm[i] = b[i] - rm[i]; // (1)
    xm[i] = x0[i];        // (2) does not require (1)
    pm[i] = rm[i];        // (3) requires (1) at this i, not (2)
}  
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel

scalarProd(rm,rm,numcols);

Notice how I show that no barriers are actually necessary between your loops anyway. 请注意，我如何表明循环之间实际上并不需要任何障碍。

If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. 如果您的大部分时间都花在了此计算阶段，那么您肯定会看到很大的进步。 However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd() , so the amount of time you'll be saving is probably minimal. 但是，我有理由相信，您的大部分时间都花在Mat_Vec_Mult() ，也许还scalarProd() ，因此节省的时间可能最少。

** EDIT ** ** 编辑 **

And as per your edit, I am seeing a few problems. 根据您的编辑，我看到了一些问题。 (1) Always compile with -O3 when you are testing performance of your algorithm. （1）测试算法性能时，请始终使用-O3进行编译。 (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; （2）您将无法改善耗时.000007秒才能完成的工作； that's nearly instantaneous. 那几乎是瞬间的。 This goes back to what I said previously: try and parallelize at a higher level. 这可以追溯到我之前说的：尝试在更高级别进行并行化。 CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. CG方法本质上是一种顺序算法，但是肯定有一些研究论文详细介绍了并行CG。 (3) Your implementation of scalar product is not optimal. （3）标量积的实现不是最佳的。 Indeed, I suspect your implementation of matrix-vector product is not either. 确实，我怀疑您对矩阵向量乘积的实现也不是。 I would personally do the following: 我个人将执行以下操作：

double scalarProd(double* vec1, double* vec2, int n) {
    double prod = 0.0;
    int i;

    // omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
    #pragma omp parallel for private(i) reduction(+:prod)
    for (i = 0; i < n; ++i) {
        prod += vec1[i]*vec2[i];
    }
    return prod;
}

(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. （4）整个库（LAPACK，BLAS等）具有高度优化的矩阵向量，向量向量等操作。 Any Linear Algebra library must be built upon them. 任何线性代数库都必须建立在它们之上。 Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own. 因此，我建议在开始在此处重新创建转轮并尝试实现自己的车轮之前，请考虑使用这些库之一来进行两项操作。

稀疏矩阵的多线程程序

问题描述

1 个解决方案

解决方案1
0 2016-04-28 15:24:12

稀疏矩阵的多线程程序

问题描述

1 个解决方案

解决方案1 0 2016-04-28 15:24:12

解决方案1
0 2016-04-28 15:24:12