简体   繁体   English

STL 并行执行与 OpenMP 性能

[英]STL parallel execution vs. OpenMP performance

I'm starting a new project and would like to parallelize some computations.我正在开始一个新项目,并希望并行化一些计算。 I've used OpenMP in the past, but am aware that now many STL algorithms can be parallelized directly.我过去使用过 OpenMP,但我知道现在可以直接并行化许多 STL 算法。 Since both approaches follow different paradigms (eg raw loops versus iterators and anonymous functions), I'd like to choose one up front.由于这两种方法都遵循不同的范例(例如原始循环与迭代器和匿名函数),我想预先选择一种。

Which is generally faster?一般哪个更快?

To test this I benchmarked the following C++20 code:为了对此进行测试,我对以下 C++20 代码进行了基准测试:

#include <algorithm>
#include <iostream>
#include <vector>
#include <numeric>
#include <cmath>
#include <chrono>
#include <execution>

template <class ExecutionPolicy>
int test_stl(const std::vector<double>& X, ExecutionPolicy policy) {
    std::vector<double> Y(X.size());
    const auto start = std::chrono::high_resolution_clock::now();
    std::transform(policy, X.cbegin(), X.cend(), Y.begin(), [](double x){
        volatile double y = std::sin(x);
        return y;
    });
    const auto stop = std::chrono::high_resolution_clock::now();
    auto diff = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    return diff.count();
}

int test_openmp(const std::vector<double>& X) {
    std::vector<double> Y(X.size());
    const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
    for (size_t i = 0; i < X.size(); ++i) {
        volatile double y = std::sin(X[i]);
        Y[i] = y;
    }
    const auto stop = std::chrono::high_resolution_clock::now();
    auto diff = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    return diff.count();
}

int main() {
    const size_t N = 10000000;
    std::vector<double> data(N);
    std::iota(data.begin(), data.end(), 1);
    std::cout << "OpenMP:        " << test_openmp(data) << std::endl;
    std::cout << "STL seq:       " << test_stl(data, std::execution::seq) << std::endl;
    std::cout << "STL par:       " << test_stl(data, std::execution::par) << std::endl;
    std::cout << "STL par_unseq: " << test_stl(data, std::execution::par_unseq) << std::endl;
    std::cout << "STL unseq:     " << test_stl(data, std::execution::unseq) << std::endl;
    return 0;
}

Compiled on my machine with GCC 10.3.0 (MSYS2), the OpenMP code consistently runs ~10 times faster:在我的机器上使用 GCC 10.3.0 (MSYS2) 编译,OpenMP 代码的运行速度始终快 10 倍:

OpenMP:        54719
STL seq:       628451
STL par:       638454
STL par_unseq: 494143
STL unseq:     506647

Is OpenMP faster in general (heuristically) for functionally equivalent code?对于功能等效的代码,OpenMP 通常(启发式)更快吗? Given the current state of development, might this change in the future?考虑到目前的 state 发展,这可能会在未来改变吗?

Edit:编辑:

I'm building this benchmark using the follow CMakeLists.txt :我正在使用以下CMakeLists.txt构建此基准:

cmake_minimum_required(VERSION 3.19)

add_executable(TEST main.cpp)
target_compile_features(TEST PRIVATE cxx_std_20)
set_target_properties(TEST PROPERTIES CXX_EXTENSIONS OFF)

find_package(OpenMP)
target_link_libraries(TEST PUBLIC OpenMP::OpenMP_CXX)

And then I compile it with the Windows Powershell commands:然后我用 Windows Powershell 命令编译它:

cmake .. -G "MinGW Makefiles"
mingw32-make
./TEST.exe

I've tested your code (just changing size_t to int in the openmp implementation) with MSVC in my windows 11 machine because I thought it was very strange to have almos all stl parallel with the same performance... The seq execution policy does not do parallelism at all... and in your test it was performing much close to the other execution policies...我已经在我的 windows 11 机器上使用 MSVC 测试了你的代码(只是在 openmp 实现中将 size_t 更改为 int),因为我认为让 almos all stl 以相同的性能并行是非常奇怪的...... seq执行策略不完全执行并行性......并且在你的测试中它的表现非常接近其他执行策略......

So, I've compiled with this:所以,我已经编译了这个:

cl.exe /Zi /EHsc /nologo /std:c++latest /O2 /openmp /Fe: .\openmp-vs-exec-policy.exe .\openmp-vs-exec-policy.cpp

And my results were:我的结果是:

.\openmp-vs-exec-policy.exe
OpenMP:        14089
STL seq:       99299
STL par:       10659
STL par_unseq: 9811
STL unseq:     68051

In another tests of mine, stl performs better than openmp almost always...在我的另一个测试中,stl 几乎总是比 openmp 表现更好......

So, my guess is that the stl you are using is not very well implemented or the GCC for windows does not do a good job compiling the stl...所以,我的猜测是你正在使用的 stl 没有很好地实现或者 GCC 对于 windows 没有很好地编译 stl ...

[EDIT] [编辑]

I was looking for g++ implementation of STL parallelism, and found out that it only works if you have the libtbb installed with it.我一直在寻找 g++ 实现 STL 并行性,发现它只有在安装了 libtbb 时才有效。

Just like OpenMP only works if you compile with -fopenmp and if it is not passed to the compiler everything falls back to sequential, STL implementation of execution policy falls back to sequential if you doesn't have the libtbb installed and it does not come by default in g++.就像 OpenMP 只有在使用-fopenmp编译时才有效,如果它没有传递给编译器,一切都会回落到顺序,STL 如果你没有安装 libtbb,执行策略的实现会回落到顺序默认为 g++。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM