簡體   English   中英

嚴重的性能損失交替 OpenMP 並行線程數

[英]Severe performance loss alternating number of OpenMP parallel threads

以下代碼更改用於交替並行 fors 的並行線程數。

#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>

std::vector<float> v;

float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
    float total = 0;
    std::vector<int>nthreads{threadsFirst,threadsSecond};
    for (int nthread : nthreads) {
        omp_set_num_threads(nthread);
#pragma omp parallel for
        for (int i = 0; i < tasks; ++i) {
            for (int n = 0; n < perTaskComputation; ++n) {
                if (v[i] > 5) {
                    v[i] * 0.002;
                }
                v[i] *= 1.1F * (i + 1);
            }
        }
        for (auto a : v) {
            total += a;
        }
    }
    return total;
}

int main()
{
    int tasks = 1000;
    int load = 1000;
    v.resize(tasks, 1);
    for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
        std::cout << "Run batch\n";
        for (int j = 1; j <= 16; j += 1) {
            float minT = 1e100;
            float maxT = 0;
            float totalT = 0;
            int samples = 0;
            int iters = 100;
            for (float i = 0; i <= iters; ++i) {
                auto start = std::chrono::steady_clock::now();
                foo(tasks, load, j, j + threadAdd);
                auto end = std::chrono::steady_clock::now();
                float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
                if (i > 20) {
                    minT = std::min(minT, ms);
                    maxT = std::max(maxT, ms);
                    totalT += ms;
                    samples++;
                }
            }
            std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
                << minT << "ms   Max: " << maxT << "ms   Avg: " << totalT / samples << "ms" << std::endl;
        }
    }
}

在發布模式下使用 Visual Studio 2019 編譯和運行時,這是 output:

Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms   Max: 2.47ms   Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms   Max: 1.234ms   Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms   Max: 0.759ms   Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms   Max: 0.578ms   Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms   Max: 0.676ms   Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms   Max: 0.999ms   Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms   Max: 0.786ms   Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms   Max: 0.948ms   Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms   Max: 0.504ms   Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms   Max: 0.702ms   Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms   Max: 1.104ms   Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms   Max: 1.01ms   Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms   Max: 3.577ms   Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms   Max: 0.792ms   Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms   Max: 0.723ms   Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms   Max: 0.858ms   Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms   Max: 3.501ms   Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms   Max: 4.809ms   Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms   Max: 14.394ms   Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms   Max: 8.572ms   Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms   Max: 15.739ms   Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms   Max: 16.787ms   Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms   Max: 39.971ms   Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms   Max: 45.473ms   Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms   Max: 31.844ms   Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms   Max: 21.199ms   Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms   Max: 21.608ms   Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms   Max: 18.779ms   Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms   Max: 26.991ms   Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms   Max: 27.701ms   Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms   Max: 26.351ms   Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms   Max: 40.517ms   Avg: 22.0216ms

在第一批中,隨着線程數量的增加,完成了幾次運行,使用相同數量的線程交替並行 fors。 該批次產生預期的行為,隨着線程數量的增加而提高性能。

然后完成第二批,運行相同的代碼,但交替並行 fors,其中一個使用的線程比另一個多一個。 第二批有嚴重的性能損失,將計算時間增加到 50~100 倍。

在 Ubuntu 中使用 gcc 編譯和運行會導致預期的行為,兩個批次的性能相似。

那么,問題是,在使用 Visual Studio 時,是什么導致了這種巨大的性能損失?

至於問題評論中解釋的實驗,並且缺乏更好的解釋,這似乎是VS運行時中的一個錯誤。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM