简体   繁体   English

使用OpenMP进行稀疏矩阵乘法的缓存管理

[英]Cache management for sparse matrix multiplication using OpenMP

I am having issues with what I think is some false caching, I am only getting a small speedup when using the following code compared to not the unparalleled version. 我在一些错误的缓存方面遇到了问题,与无与伦比的版本相比,使用以下代码时我只会得到很小的加速。

matrix1 and matrix2 are sparse matrices in a struct with (row, col, val) format. matrix1和matrix2是具有(row,col,val)格式的结构中的稀疏矩阵。

void pMultiply(struct SparseRow *matrix1, struct SparseRow *matrix2, int m1Rows, int m2Rows, struct SparseRow **result) {

*result = malloc(1 * sizeof(struct SparseRow));

int resultNonZeroEntries = 0;

#pragma omp parallel for atomic
for(int i = 0; i < m1Rows; i++)
{
    int curM1Row = matrix1[i].row;
    int curM1Col = matrix1[i].col;
    float curM1Value = matrix1[i].val;

    for(int j = 0; j < m2Rows; j++)
    {

        int curM2Row = matrix2[j].row;
        int curM2Col = matrix2[j].col;
        float curM2Value = matrix2[j].val;

        if(curM1Col == curM2Row)
        {
            *result = realloc(*result, 
            (sizeof(struct SparseRow)*(resultNonZeroEntries+1)));

            (*result)[resultNonZeroEntries].row = curM1Row;
            (*result)[resultNonZeroEntries].col = curM2Col;
            (*result)[resultNonZeroEntries].val += curM1Value*curM2Value;
            resultNonZeroEntries++;
            break;
        }

    }
}

Several issues there: 那里有几个问题:

  • As mentionned by Brian Brochers, the #pragma omp atomic clause should be put just before the line that needs to be protected against a race condition. 正如Brian Brochers提到的那样,应将#pragma omp atomic子句放在需要防止出现竞争状况的行的前面。
  • Reallocating memory at each step is likely a performance killer. 在每个步骤重新分配内存可能会导致性能下降。 If the memory cannot be reallocated in place and needs to be copied elsewhere, this would be slow. 如果无法将内存重新分配到位,而需要将其复制到其他位置,则速度会很慢。 It is also a source of errors, as the value of pointer result is modified. 由于指针result的值被修改,它也是错误的来源。 Other threads keep running wile the reallocation takes place and may try to access memory at the "old" address, or several threads may try to reallocate results concurrently. 重新分配发生时,其他线程将继续运行,并且可能尝试访问“旧”地址处的内存,或者几个线程可能尝试同时重新分配results Placing the whole realloc + addition part in a critical section would be safer, but will essentially serialize the function for anything but testing equality of line/column indices at the cost of a significant overhead. 将整个realloc +加法部分放置在关键部分会更安全,但是除了测试行/列索引的相等性之外,它将本质上对函数进行序列化,但这会花费大量开销。 Threads should work with on local buffer, then merge their results at a later stage. 线程应在本地缓冲区上使用,然后在以后合并它们的结果。 Reallocation should be done by blocks of sufficient size. 重新分配应该由足够大小的块完成。

     // Make sure this will compile even without openmp + include memcpy #include <string.h> #ifdef _OPENMP #define thisThread omp_thread_num() #define nThreads omp_num_threads() #else #define thisThread 0 #define nThreads 1 #endif // shared variables int totalNonZero,*copyIndex,*threadNonZero; #pragma omp parallel { // each thread now initialize a local buffer and local variables int localNonZero = 0; int allocatedSize = 1024; SparseRow *localResult = malloc(allocatedSize * sizeof(*SparseRow)); // one thread initialize an array #pragma omp single { threadNonZero=malloc(nThreads*sizeof(int));copyIndex=malloc((nThreads+1)*sizeof(int)); } #pragma omp for for (int i = 0; i < m1Rows; i++){ /* * do the same as your initial code but: * realloc an extra 1024 lines each time localNonZeros exceeds allocatedSize * fill the local buffer and increment the localNonZeros counter * this is safe, no need to use critical / atomic clauses */ } copyIndex[thisThread]=localNonZero; //put number of non zero into a shared variable #pragma omp barrier // Wrap_up : check how many non zero values for each thread, allocate the output and check where each thread will copy its local buffer #pragma omp single { copyIndex[0]=0; for (int i=0; i<nThreads; ii++) copyIndex[i+1]=localNonZero[i]+copyIndex[i]; result=malloc(copyIndex[nThreads+1]*sizeof(*SparseRow)); } // Copy the results from local to global result memcpy(&result[copyIndex[thisThread]],localResult,localNonZero*sizeof(*SparseRow); // Free memory free(localResult); #pragma omp single { free(copyIndex); free(localNonZero); } } // end parallel 
  • Please note that the algorithm will generate duplicates eg if the first matrix contains values at positions (1,10) and (1,20) and the second one (10,5) and (20,5), there will be two (1,5) lines in the result. 请注意,该算法将生成重复项,例如,如果第一个矩阵在位置(1,10)和(1,20)包含值,而第二个矩阵在(10,5)和(20,5)处,则将有两个(1 ,5)行中的结果。 At some point, a compaction function that merge duplicate lines will be needed. 在某个时候,将需要合并重复行的压缩函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM