openMp：調用動態數組的共享引用時嚴重的性能損失

Question

我正在編寫cfd模擬，並希望並行化〜10 ^ 5循環（晶格大小），這是成員函數的一部分。 openMp代碼的實現非常簡單：我讀取共享數組的條目，使用線程專用數量進行計算，最后再次寫入共享數組。 在每個數組中，我只訪問循環號的array元素，所以我不希望出現競爭狀況，也看不到任何刷新的理由。 通過測試代碼（並行部分）的加速，我發現除了一個CPU之外，其他所有代碼的運行速度僅為70％。 有沒有人知道如何改善這一點？

void class::funcPar(bool parallel){
#pragma omp parallel
{
    int one, two, three;
    double four, five;

    #pragma omp for
    for(int b=0; b<lenAr; b++){
        one = A[b]+B[b];
        C[b] = one;
        one += D[b];
        E[b] = one;
    }
}

}

Answer 1

要點，然后是測試代碼，然后是討論：

如果每個項目都是一個int ，則10 ^ 5並不多。 啟動多個線程所產生的開銷可能大於收益。
使用OMP時，編譯器優化可能會陷入混亂。
當處理每組內存很少的操作時，循環可能會受到內存的限制（即，CPU花時間等待請求的內存被交付）

如所承諾的，這是代碼：

#include <iostream>
#include <chrono>
#include <Eigen/Core>


Eigen::VectorXi A;
Eigen::VectorXi B;
Eigen::VectorXi D;
Eigen::VectorXi C;
Eigen::VectorXi E;
int size;

void regular()
{
    //#pragma omp parallel
    {
        int one;
//      #pragma omp for
        for(int b=0; b<size; b++){
            one = A[b]+B[b];
            C[b] = one;
            one += D[b];
            E[b] = one;
        }
    }
}

void parallel()
{
#pragma omp parallel
    {
        int one;
        #pragma omp for
        for(int b=0; b<size; b++){
            one = A[b]+B[b];
            C[b] = one;
            one += D[b];
            E[b] = one;
        }
    }
}

void vectorized()
{
    C = A+B;
    E = C+D;
}

void both()
{
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        int vals = size / nthreads;
        int startInd = tid * vals;
        if(tid == nthreads - 1)
            vals += size - nthreads * vals;
        auto am = Eigen::Map<Eigen::VectorXi>(A.data() + startInd, vals);
        auto bm = Eigen::Map<Eigen::VectorXi>(B.data() + startInd, vals);
        auto cm = Eigen::Map<Eigen::VectorXi>(C.data() + startInd, vals);
        auto dm = Eigen::Map<Eigen::VectorXi>(D.data() + startInd, vals);
        auto em = Eigen::Map<Eigen::VectorXi>(E.data() + startInd, vals);
        cm = am+bm;
        em = cm+dm;
    }
}
int main(int argc, char* argv[])
{
    srand(time(NULL));
    size = 100000;
    int iterations = 10;
    if(argc > 1)
        size = atoi(argv[1]);
    if(argc > 2)
        iterations = atoi(argv[2]);
    std::cout << "Size: " << size << "\n";
    A = Eigen::VectorXi::Random(size);
    B = Eigen::VectorXi::Random(size);
    D = Eigen::VectorXi::Random(size);
    C = Eigen::VectorXi::Zero(size);
    E = Eigen::VectorXi::Zero(size);

    auto startReg = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        regular();
    auto endReg = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startPar = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        parallel();
    auto endPar = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startVec = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        vectorized();
    auto endVec = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startPVc = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        both();
    auto endPVc = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    std::cout << "Timings:\n";
    std::cout << "Regular:    " << std::chrono::duration_cast<std::chrono::microseconds>(endReg - startReg).count() / iterations << "\n";
    std::cout << "Parallel:   " << std::chrono::duration_cast<std::chrono::microseconds>(endPar - startPar).count() / iterations << "\n";
    std::cout << "Vectorized: " << std::chrono::duration_cast<std::chrono::microseconds>(endVec - startVec).count() / iterations << "\n";
    std::cout << "Both      : " << std::chrono::duration_cast<std::chrono::microseconds>(endPVc - startPVc).count() / iterations << "\n";

    return 0;
}

我使用Eigen作為向量庫來幫助證明要點re：optimizations，我很快就會實現。 代碼以四種不同的優化模式進行編譯：

g ++ -fopenmp -std = c ++ 11 -Wall -pedantic -pthread -IC：\\ usr \\ include source.cpp -o a.exe

g ++ -fopenmp -std = c ++ 11 -Wall -pedantic -pthread -O1 -IC：\\ usr \\ include source.cpp -o aO1.exe

g ++ -fopenmp -std = c ++ 11 -Wall -pedantic -pthread -O2 -IC：\\ usr \\ include source.cpp -o aO2.exe

g ++ -fopenmp -std = c ++ 11 -Wall -pedantic -pthread -O3 -IC：\\ usr \\ include source.cpp -o aO3.exe

在Windows下使用g ++（x86_64-posix-sjlj，由Strawberryperl.com項目構建）4.8.3。

討論區

我們將從查看10 ^ 5 vs 10 ^ 6元素開始，這些元素在沒有優化的情況下平均100次。

10 ^ 5（無優化）：

Timings:
Regular:    9300
Parallel:   2620
Vectorized: 2170
Both      : 910

10 ^ 6（無優化）：

Timings:
Regular:    93535
Parallel:   27191
Vectorized: 21831
Both      : 8600

就加速而言，矢量化（SIMD）勝過OMP。 結合起來，我們會得到更好的時光。

移至-O1：

10 ^ 5：

Timings:
Regular:    780
Parallel:   300
Vectorized: 80
Both      : 80

10 ^ 6：

Timings:
Regular:    7340
Parallel:   2220
Vectorized: 1830
Both      : 1670

與不進行優化的情況相同，只是計時要好得多。

跳至-O3：

10 ^ 5：

Timings:
Regular:    380
Parallel:   130
Vectorized: 80
Both      : 70

10 ^ 6：

Timings:
Regular:    3080
Parallel:   1750
Vectorized: 1810
Both      : 1680

對於10 ^ 5，優化仍然勝過一切。 但是，10 ^ 6為OMP循環提供比矢量化更快的時序。

在所有測試中，我們將OMP的速度提高了x2-x4。

注意：我最初是在使用所有內核的另一個低優先級進程時運行測試的。 出於某種原因，這主要影響並行測試，而不影響其他並行測試。 確保您正確地安排時間。

結論

您的最小代碼示例不符合要求。 諸如內存訪問模式之類的問題可能出現在更復雜的數據上。 添加足夠的詳細信息以准確地重現您的問題（ MCVE ），以獲得更好的幫助。

openMp：調用動態數組的共享引用時嚴重的性能損失

問題描述

1 個解決方案

解決方案1
1 已采納 2015-06-17 11:45:16

討論區

結論

openMp：調用動態數組的共享引用時嚴重的性能損失

問題描述

1 個解決方案

解決方案1 1 已采納 2015-06-17 11:45:16

討論區

結論

解決方案1
1 已采納 2015-06-17 11:45:16