使用OpenMP進行稀疏矩陣乘法的緩存管理

Question

我在一些錯誤的緩存方面遇到了問題，與無與倫比的版本相比，使用以下代碼時我只會得到很小的加速。

matrix1和matrix2是具有（row，col，val）格式的結構中的稀疏矩陣。

void pMultiply(struct SparseRow *matrix1, struct SparseRow *matrix2, int m1Rows, int m2Rows, struct SparseRow **result) {

*result = malloc(1 * sizeof(struct SparseRow));

int resultNonZeroEntries = 0;

#pragma omp parallel for atomic
for(int i = 0; i < m1Rows; i++)
{
    int curM1Row = matrix1[i].row;
    int curM1Col = matrix1[i].col;
    float curM1Value = matrix1[i].val;

    for(int j = 0; j < m2Rows; j++)
    {

        int curM2Row = matrix2[j].row;
        int curM2Col = matrix2[j].col;
        float curM2Value = matrix2[j].val;

        if(curM1Col == curM2Row)
        {
            *result = realloc(*result, 
            (sizeof(struct SparseRow)*(resultNonZeroEntries+1)));

            (*result)[resultNonZeroEntries].row = curM1Row;
            (*result)[resultNonZeroEntries].col = curM2Col;
            (*result)[resultNonZeroEntries].val += curM1Value*curM2Value;
            resultNonZeroEntries++;
            break;
        }

    }
}

Answer 1

那里有幾個問題：

正如Brian Brochers提到的那樣，應將#pragma omp atomic子句放在需要防止出現競爭狀況的行的前面。

在每個步驟重新分配內存可能會導致性能下降。 如果無法將內存重新分配到位，而需要將其復制到其他位置，則速度會很慢。 由於指針result的值被修改，它也是錯誤的來源。 重新分配發生時，其他線程將繼續運行，並且可能嘗試訪問“舊”地址處的內存，或者幾個線程可能嘗試同時重新分配results 。 將整個realloc +加法部分放置在關鍵部分會更安全，但是除了測試行/列索引的相等性之外，它將本質上對函數進行序列化，但這會花費大量開銷。 線程應在本地緩沖區上使用，然后在以后合並它們的結果。 重新分配應該由足夠大小的塊完成。

 // Make sure this will compile even without openmp + include memcpy #include <string.h> #ifdef _OPENMP #define thisThread omp_thread_num() #define nThreads omp_num_threads() #else #define thisThread 0 #define nThreads 1 #endif // shared variables int totalNonZero,*copyIndex,*threadNonZero; #pragma omp parallel { // each thread now initialize a local buffer and local variables int localNonZero = 0; int allocatedSize = 1024; SparseRow *localResult = malloc(allocatedSize * sizeof(*SparseRow)); // one thread initialize an array #pragma omp single { threadNonZero=malloc(nThreads*sizeof(int));copyIndex=malloc((nThreads+1)*sizeof(int)); } #pragma omp for for (int i = 0; i < m1Rows; i++){ /* * do the same as your initial code but: * realloc an extra 1024 lines each time localNonZeros exceeds allocatedSize * fill the local buffer and increment the localNonZeros counter * this is safe, no need to use critical / atomic clauses */ } copyIndex[thisThread]=localNonZero; //put number of non zero into a shared variable #pragma omp barrier // Wrap_up : check how many non zero values for each thread, allocate the output and check where each thread will copy its local buffer #pragma omp single { copyIndex[0]=0; for (int i=0; i<nThreads; ii++) copyIndex[i+1]=localNonZero[i]+copyIndex[i]; result=malloc(copyIndex[nThreads+1]*sizeof(*SparseRow)); } // Copy the results from local to global result memcpy(&result[copyIndex[thisThread]],localResult,localNonZero*sizeof(*SparseRow); // Free memory free(localResult); #pragma omp single { free(copyIndex); free(localNonZero); } } // end parallel

請注意，該算法將生成重復項，例如，如果第一個矩陣在位置（1,10）和（1,20）包含值，而第二個矩陣在（10,5）和（20,5）處，則將有兩個（1 ，5）行中的結果。 在某個時候，將需要合並重復行的壓縮函數。

使用OpenMP進行稀疏矩陣乘法的緩存管理

問題描述

1 個解決方案

解決方案1
0 2018-10-29 14:55:49

使用OpenMP進行稀疏矩陣乘法的緩存管理

問題描述

1 個解決方案

解決方案1 0 2018-10-29 14:55:49

解決方案1
0 2018-10-29 14:55:49