简体   繁体   English

openmp-并行矢量矩阵乘积

[英]openmp - Parallel Vector Matrix Product

I am computing a vector matrix(The matrix is sparse and stored in CSR) product(Not exactly a product, a slight variation to compute shortest distance) using the outer product method. 我正在使用外积方法计算向量矩阵(矩阵是稀疏的,并存储在CSR中)乘积(不完全是乘积,为了计算最短距离而略有变化)。 I am new to parallel programming and essentially trying to understand the difference between using a parallel for section with a critical section for the update VS using tasks and doing reduction. 我是并行编程的新手,从本质上来说,我试图理解使用并行的for部分和关键部分来更新VS(使用任务)与简化之间的区别。 Which is the better approach and why? 哪种方法更好,为什么呢?

Note: This function call is enclosed with a omp parallel and an omp single. 注意:此函数调用包含在omp并行和omp单个中。

Using parallel for approach, I am doing this: 使用并行方法,我正在这样做:

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq) {
    initialize_T(tReq);
    int index;
    #pragma omp parallel for schedule(static, BLOCK_SIZE)
    for(int i=0;i<N;i++) {
        int num_edges = A->row_ptr[i+1] - A->row_ptr[i];
        index = 0;
        if(num_edges) {
            if(T[i] != INFINITY && tB[i] != INFINITY) {
                for(int j=0;j<num_edges;j++) {
                    index = A->col_ind[A->row_ptr[i] + j];
                    #pragma omp critical 
                    tReq[index] = min(tReq[index], T[i]+A->val[A->row_ptr[i]+j]);      
                }
            }
        }
    }
    return tReq;
}

Using the task based approach with reduction, this is essentially my idea: 通过减少使用基于任务的方法,这本质上是我的想法:

int size = N/BLOCK_SIZE + 1;
double C[size][N];

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq, int size, double C[][N]) {

    int index;

    for(int i=0;i<size;i++) {
        for(int j=0;j<N;j++) {
            C[i][j] = INFINITY;
            tReq[j] = INFINITY;
        }
    }

    int k;

    for(k=0;k<size-1; k++) {
        #pragma omp task firstprivate(k) depend(inout: C[k]) 
        {
            int index;
            int delim;   
            delim = (k==size-1) ? N : BLOCK_SIZE;
            printf("kk is %d\n", k*BLOCK_SIZE);
            // printf("k is %d Delim is %d for thread %d\n", k, delim, omp_get_thread_num());
            for(int i=0;i<delim; i++) {
                int num_edges = A->row_ptr[k*BLOCK_SIZE + i+1] - A->row_ptr[k*BLOCK_SIZE + i];
                index = 0;
                if(num_edges) {
                    if(T[k*BLOCK_SIZE + i] != INFINITY && tB[k*BLOCK_SIZE + i] != INFINITY) {           
                        for(int j=0;j<num_edges;j++) {
                            index = A->col_ind[A->row_ptr[k*BLOCK_SIZE + i] + j];
                            {
                            C[k][index] = min(C[k][index], T[k*BLOCK_SIZE + i]+A->val[A->row_ptr[k*BLOCK_SIZE + i]+j]);                 
                            }
                        }
                    }       
                }   
            }
        }
    }    

    #pragma omp taskwait

    for(int i=0; i<N; i++) {
        {
            double minimum = INFINITY;
            for(int j=0;j<size;j++) {
                if(C[j][i] < minimum) {
                    minimum = C[j][i];
                }
            }
            tReq[i] = minimum;
        }
    }

    return tReq;
}

Essentially, is there any downsides to using parallel for compared to the task based approach? 从本质上讲,与基于任务的方法相比,使用并行进行操作有什么缺点吗?

You are right that you have basically two options: Protect the data update or use thread-specific copies. 没错,您基本上有两个选择:保护数据更新或使用特定于线程的副本。 However, you can do much better for each option: 但是,对于每个选项,您都可以做得更好:

When going with protected updates, you should protect only what and when absolutely necessary. 当使用受保护的更新时,您应该仅保护绝对必要的内容和时间。 You can use an initial atomic check to prevent critical regions most of the time similar to a double checked lock pattern. 您可以使用一次初始原子检查来防止关键区域在大多数情况下类似于双重检查锁定模式。

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq) {
    initialize_T(tReq);
    #pragma omp parallel for schedule(static, BLOCK_SIZE)
    for(int i=0;i<N;i++) {
        int num_edges = A->row_ptr[i+1] - A->row_ptr[i];
        if (num_edges) {
            if(T[i] != INFINITY && tB[i] != INFINITY) {
                for(int j=0;j<num_edges;j++) {
                    // !WARNING! You MUST declare index within the parallel region
                    // or explicitly declare it private to avoid data races!
                    int index = A->col_ind[A->row_ptr[i] + j];
                    double tmp = T[i] + A->val[A->row_ptr[i]+j];
                    double old;
                    #pragma omp atomic
                    old = tReq[index];
                    if (tmp < old) {
                        #pragma omp critical
                        {
                            tmp = min(tReq[index], tmp);
                            // Another atomic ensures that the earlier read
                            // outside of critical works correctly
                            #pragma omp atomic
                            tReq[index] = tmp;
                        }
                    }
                }
            }
        }
    }
    return tReq;
}

Note: Unfortunately no OpenMP/C does not support a direct atomic minimum. 注意:不幸的是,没有OpenMP / C不支持直接原子最低要求。

The alternative is a reduction, which is even supported by the standard itself. 另一种选择是减少,甚至标准本身也支持。 So there is no need to reinvent the work-sharing etc. You can simply do the following: 因此,无需重新设计工作共享等。您只需执行以下操作:

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq) {
    initialize_T(tReq);
    #pragma omp parallel for schedule(static, BLOCK_SIZE) reduction(min:tReq)
    for(int i=0;i<N;i++) {
        int num_edges = A->row_ptr[i+1] - A->row_ptr[i];
        if (num_edges) {
            if(T[i] != INFINITY && tB[i] != INFINITY) {
                for(int j=0;j<num_edges;j++) {
                    // !WARNING! You MUST declare index within the parallel region
                    // or explicitly declare it private to avoid data races!
                    int index = A->col_ind[A->row_ptr[i] + j];
                    tReq[index] = min(tReq[index], T[i]+A->val[A->row_ptr[i]+j]);
                }
            }
        }
    }
    return tReq;
}

OpenMP will magically created thread-local copies of tReq and merge (reduce) them at the end. OpenMP将神奇地创建tReq线程本地副本,并在最后合并(减少)它们。

Which version is better for you depends on the size of the target array and the rate of writes. 哪种版本更适合您取决于目标阵列的大小和写入速率。 If you write often, the reduction will be beneficial because it is not slowed down by critical / atomic / bad caching. 如果您经常编写,那么减少将是有益的,因为critical / atomic /不良缓存不会降低速度。 If you have a huge target array or not so many update-iterations, the first solution becomes more interesting because the relative overhead of creating and reducing the array. 如果您有一个庞大的目标数组或没有那么多更新迭代,则第一个解决方案将变得更加有趣,因为创建和减少数组的相对开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM