简体   繁体   English

严重的性能损失交替 OpenMP 并行线程数

[英]Severe performance loss alternating number of OpenMP parallel threads

The following code changes the number of parallel threads used for alternating parallel fors.以下代码更改用于交替并行 fors 的并行线程数。

#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>

std::vector<float> v;

float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
    float total = 0;
    std::vector<int>nthreads{threadsFirst,threadsSecond};
    for (int nthread : nthreads) {
        omp_set_num_threads(nthread);
#pragma omp parallel for
        for (int i = 0; i < tasks; ++i) {
            for (int n = 0; n < perTaskComputation; ++n) {
                if (v[i] > 5) {
                    v[i] * 0.002;
                }
                v[i] *= 1.1F * (i + 1);
            }
        }
        for (auto a : v) {
            total += a;
        }
    }
    return total;
}

int main()
{
    int tasks = 1000;
    int load = 1000;
    v.resize(tasks, 1);
    for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
        std::cout << "Run batch\n";
        for (int j = 1; j <= 16; j += 1) {
            float minT = 1e100;
            float maxT = 0;
            float totalT = 0;
            int samples = 0;
            int iters = 100;
            for (float i = 0; i <= iters; ++i) {
                auto start = std::chrono::steady_clock::now();
                foo(tasks, load, j, j + threadAdd);
                auto end = std::chrono::steady_clock::now();
                float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
                if (i > 20) {
                    minT = std::min(minT, ms);
                    maxT = std::max(maxT, ms);
                    totalT += ms;
                    samples++;
                }
            }
            std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
                << minT << "ms   Max: " << maxT << "ms   Avg: " << totalT / samples << "ms" << std::endl;
        }
    }
}

When compiled and run with Visual Studio 2019 in Release mode, this is the output:在发布模式下使用 Visual Studio 2019 编译和运行时,这是 output:

Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms   Max: 2.47ms   Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms   Max: 1.234ms   Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms   Max: 0.759ms   Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms   Max: 0.578ms   Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms   Max: 0.676ms   Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms   Max: 0.999ms   Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms   Max: 0.786ms   Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms   Max: 0.948ms   Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms   Max: 0.504ms   Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms   Max: 0.702ms   Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms   Max: 1.104ms   Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms   Max: 1.01ms   Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms   Max: 3.577ms   Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms   Max: 0.792ms   Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms   Max: 0.723ms   Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms   Max: 0.858ms   Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms   Max: 3.501ms   Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms   Max: 4.809ms   Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms   Max: 14.394ms   Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms   Max: 8.572ms   Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms   Max: 15.739ms   Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms   Max: 16.787ms   Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms   Max: 39.971ms   Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms   Max: 45.473ms   Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms   Max: 31.844ms   Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms   Max: 21.199ms   Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms   Max: 21.608ms   Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms   Max: 18.779ms   Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms   Max: 26.991ms   Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms   Max: 27.701ms   Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms   Max: 26.351ms   Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms   Max: 40.517ms   Avg: 22.0216ms

In a first batch, several runs with increasing number of threads are done, alternating parallel fors using the same number of threads.在第一批中,随着线程数量的增加,完成了几次运行,使用相同数量的线程交替并行 fors。 This batch produces an expected behavior, increasing preformance as the number of threads is increase.该批次产生预期的行为,随着线程数量的增加而提高性能。

Then a second batch is done, runing the same code but alternating parallel fors where one of them uses one more thread than the other.然后完成第二批,运行相同的代码,但交替并行 fors,其中一个使用的线程比另一个多一个。 This second batch has a severe performance loss, increasing the computation time up to a factor of 50~100x.第二批有严重的性能损失,将计算时间增加到 50~100 倍。

Compiling and runing with gcc in Ubuntu leads to an expected behavior, with both batches performing similarly.在 Ubuntu 中使用 gcc 编译和运行会导致预期的行为,两个批次的性能相似。

So, the question is, what is causing this huge performance loss when using Visual Studio?那么,问题是,在使用 Visual Studio 时,是什么导致了这种巨大的性能损失?

As to the experiments explained in the comments to the question, and with a lack of a better explanation, it seems to be a bug in the VS runtime.至于问题评论中解释的实验,并且缺乏更好的解释,这似乎是VS运行时中的一个错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM