[英]Severe performance loss alternating number of OpenMP parallel threads
The following code changes the number of parallel threads used for alternating parallel fors.以下代码更改用于交替并行 fors 的并行线程数。
#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>
std::vector<float> v;
float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
float total = 0;
std::vector<int>nthreads{threadsFirst,threadsSecond};
for (int nthread : nthreads) {
omp_set_num_threads(nthread);
#pragma omp parallel for
for (int i = 0; i < tasks; ++i) {
for (int n = 0; n < perTaskComputation; ++n) {
if (v[i] > 5) {
v[i] * 0.002;
}
v[i] *= 1.1F * (i + 1);
}
}
for (auto a : v) {
total += a;
}
}
return total;
}
int main()
{
int tasks = 1000;
int load = 1000;
v.resize(tasks, 1);
for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
std::cout << "Run batch\n";
for (int j = 1; j <= 16; j += 1) {
float minT = 1e100;
float maxT = 0;
float totalT = 0;
int samples = 0;
int iters = 100;
for (float i = 0; i <= iters; ++i) {
auto start = std::chrono::steady_clock::now();
foo(tasks, load, j, j + threadAdd);
auto end = std::chrono::steady_clock::now();
float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
if (i > 20) {
minT = std::min(minT, ms);
maxT = std::max(maxT, ms);
totalT += ms;
samples++;
}
}
std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
<< minT << "ms Max: " << maxT << "ms Avg: " << totalT / samples << "ms" << std::endl;
}
}
}
When compiled and run with Visual Studio 2019 in Release mode, this is the output:在发布模式下使用 Visual Studio 2019 编译和运行时,这是 output:
Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms Max: 2.47ms Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms Max: 1.234ms Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms Max: 0.759ms Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms Max: 0.578ms Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms Max: 0.676ms Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms Max: 0.999ms Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms Max: 0.786ms Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms Max: 0.948ms Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms Max: 0.504ms Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms Max: 0.702ms Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms Max: 1.104ms Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms Max: 1.01ms Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms Max: 3.577ms Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms Max: 0.792ms Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms Max: 0.723ms Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms Max: 0.858ms Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms Max: 3.501ms Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms Max: 4.809ms Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms Max: 14.394ms Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms Max: 8.572ms Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms Max: 15.739ms Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms Max: 16.787ms Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms Max: 39.971ms Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms Max: 45.473ms Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms Max: 31.844ms Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms Max: 21.199ms Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms Max: 21.608ms Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms Max: 18.779ms Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms Max: 26.991ms Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms Max: 27.701ms Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms Max: 26.351ms Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms Max: 40.517ms Avg: 22.0216ms
In a first batch, several runs with increasing number of threads are done, alternating parallel fors using the same number of threads.在第一批中,随着线程数量的增加,完成了几次运行,使用相同数量的线程交替并行 fors。 This batch produces an expected behavior, increasing preformance as the number of threads is increase.该批次产生预期的行为,随着线程数量的增加而提高性能。
Then a second batch is done, runing the same code but alternating parallel fors where one of them uses one more thread than the other.然后完成第二批,运行相同的代码,但交替并行 fors,其中一个使用的线程比另一个多一个。 This second batch has a severe performance loss, increasing the computation time up to a factor of 50~100x.第二批有严重的性能损失,将计算时间增加到 50~100 倍。
Compiling and runing with gcc in Ubuntu leads to an expected behavior, with both batches performing similarly.在 Ubuntu 中使用 gcc 编译和运行会导致预期的行为,两个批次的性能相似。
So, the question is, what is causing this huge performance loss when using Visual Studio?那么,问题是,在使用 Visual Studio 时,是什么导致了这种巨大的性能损失?
As to the experiments explained in the comments to the question, and with a lack of a better explanation, it seems to be a bug in the VS runtime.至于问题评论中解释的实验,并且缺乏更好的解释,这似乎是VS运行时中的一个错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.