繁体   English   中英

严重的性能损失交替 OpenMP 并行线程数

[英]Severe performance loss alternating number of OpenMP parallel threads

以下代码更改用于交替并行 fors 的并行线程数。

#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>

std::vector<float> v;

float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
    float total = 0;
    std::vector<int>nthreads{threadsFirst,threadsSecond};
    for (int nthread : nthreads) {
        omp_set_num_threads(nthread);
#pragma omp parallel for
        for (int i = 0; i < tasks; ++i) {
            for (int n = 0; n < perTaskComputation; ++n) {
                if (v[i] > 5) {
                    v[i] * 0.002;
                }
                v[i] *= 1.1F * (i + 1);
            }
        }
        for (auto a : v) {
            total += a;
        }
    }
    return total;
}

int main()
{
    int tasks = 1000;
    int load = 1000;
    v.resize(tasks, 1);
    for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
        std::cout << "Run batch\n";
        for (int j = 1; j <= 16; j += 1) {
            float minT = 1e100;
            float maxT = 0;
            float totalT = 0;
            int samples = 0;
            int iters = 100;
            for (float i = 0; i <= iters; ++i) {
                auto start = std::chrono::steady_clock::now();
                foo(tasks, load, j, j + threadAdd);
                auto end = std::chrono::steady_clock::now();
                float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
                if (i > 20) {
                    minT = std::min(minT, ms);
                    maxT = std::max(maxT, ms);
                    totalT += ms;
                    samples++;
                }
            }
            std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
                << minT << "ms   Max: " << maxT << "ms   Avg: " << totalT / samples << "ms" << std::endl;
        }
    }
}

在发布模式下使用 Visual Studio 2019 编译和运行时,这是 output:

Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms   Max: 2.47ms   Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms   Max: 1.234ms   Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms   Max: 0.759ms   Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms   Max: 0.578ms   Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms   Max: 0.676ms   Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms   Max: 0.999ms   Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms   Max: 0.786ms   Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms   Max: 0.948ms   Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms   Max: 0.504ms   Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms   Max: 0.702ms   Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms   Max: 1.104ms   Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms   Max: 1.01ms   Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms   Max: 3.577ms   Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms   Max: 0.792ms   Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms   Max: 0.723ms   Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms   Max: 0.858ms   Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms   Max: 3.501ms   Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms   Max: 4.809ms   Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms   Max: 14.394ms   Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms   Max: 8.572ms   Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms   Max: 15.739ms   Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms   Max: 16.787ms   Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms   Max: 39.971ms   Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms   Max: 45.473ms   Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms   Max: 31.844ms   Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms   Max: 21.199ms   Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms   Max: 21.608ms   Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms   Max: 18.779ms   Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms   Max: 26.991ms   Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms   Max: 27.701ms   Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms   Max: 26.351ms   Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms   Max: 40.517ms   Avg: 22.0216ms

在第一批中,随着线程数量的增加,完成了几次运行,使用相同数量的线程交替并行 fors。 该批次产生预期的行为,随着线程数量的增加而提高性能。

然后完成第二批,运行相同的代码,但交替并行 fors,其中一个使用的线程比另一个多一个。 第二批有严重的性能损失,将计算时间增加到 50~100 倍。

在 Ubuntu 中使用 gcc 编译和运行会导致预期的行为,两个批次的性能相似。

那么,问题是,在使用 Visual Studio 时,是什么导致了这种巨大的性能损失?

至于问题评论中解释的实验,并且缺乏更好的解释,这似乎是VS运行时中的一个错误。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM