为什么std :: for_each比__gnu_parallel :: for_each快

Question

I'm trying to understand why std::for_each which runs on single thread is ~3 times faster than __gnu_parallel::for_each in the example below: 我试图理解为什么在以下示例中在单线程上运行的std::for_each比__gnu_parallel::for_each快~3倍：

Time =0.478101 milliseconds

vs 与

Time =0.166421 milliseconds

Here the code i'm using to benchmark: 这是我用来基准测试的代码：

#include <iostream>
#include <chrono>
#include <parallel/algorithm>

//The struct I'm using for timming
struct   TimerAvrg
{
    std::vector<double> times;
    size_t curr=0,n;
    std::chrono::high_resolution_clock::time_point begin,end;
    TimerAvrg(int _n=30)
    {
        n=_n;
        times.reserve(n);
    }

    inline void start()
    {
        begin= std::chrono::high_resolution_clock::now();
    }

    inline void stop()
    {
        end= std::chrono::high_resolution_clock::now();
        double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
        if ( times.size()<n)
            times.push_back(duration);
        else{
            times[curr]=duration;
            curr++;
            if (curr>=times.size()) curr=0;}
    }

    double getAvrg()
    {
        double sum=0;
        for(auto t:times)
            sum+=t;
        return sum/double(times.size());
    }
};



int main( int argc, char** argv )
{
    float sum=0;
    for(int alpha = 0; alpha <5000; alpha++)
    {
        TimerAvrg Fps;
        Fps.start();
        std::vector<float> v(1000000);
        std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
        Fps.stop();
        sum = sum + Fps.getAvrg()*1000;
    }

    std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
    return 0;
}

This is my configuration: 这是我的配置：

gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04) 

Intel® Core™ i7-7600U CPU @ 2.80GHz × 4

htop to check if the program is running in single or multiple threads htop检查程序是否在单线程或多线程中运行

g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark

The same code doesn't get compiled with gcc 8.1.0. gcc 8.1.0不会编译相同的代码。 I got that error message: 我收到该错误消息：

/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
   using __gnu_cxx::conf_hypergf;

I already checked couple of posts but either they're very old or not the same issue.. 我已经检查了几个帖子，但是它们很旧或不一样。

My questions are: 我的问题是：

Why is it slower in parallel? 为什么并行速度较慢？

I'm using the wrong functions? 我使用了错误的功能？

In cppreference it is saying that gcc with Standardization of Parallelism TS is not supported (mentioned with red color in the table) and my code is running in parallel!? 在cppreference中，这表示不支持Standardization of Parallelism TS gcc（表中以红色表示），并且我的代码正在并行运行！

Answer 1

Your function [](auto v){ v=0;} is extremely simple. 您的函数[](auto v){ v=0;} 非常简单。

The function may be replaced it with a single call to memset or use SIMD instructions for single threaded parallellism. 可以通过单次调用memset来替换该函数，也可以将SIMD指令用于单线程并行性。 With the knowledge that it overwrites the same state as the vector initially had, the entire loop could be optimised away. 知道它会覆盖与向量最初相同的状态，因此可以优化整个循环。 It may be easier for the optimiser to replace std::for_each than a parallel implementation. 对于优化器来说，替换std::for_each比并行实现要容易。

Furthermore, assuming the parallel loop uses threads, one must remember that creation and eventual synchronisation (in this case there is no need for synchronisation during processing) have overhead, which may be significant in relation to your trivial operation. 此外，假设并行循环使用线程，则必须记住创建和最终同步（在这种情况下，在处理期间无需同步）会产生开销，这对于您的琐碎操作而言可能是很重要的。

Threaded parallellism is often only worth it for computationally expensive tasks. 线程并行性通常仅在计算量大的任务上值得。 v=0 is one of the least computationally expensive operations there are. v=0是存在的计算成本最低的操作之一。

Answer 2

Your benchmark is faulty, I'm even surprised it takes time to run it. 您的基准测试是有缺陷的，我什至感到惊讶，它需要时间来运行。

You wrote: std::for_each(v.begin(), v.end(),[](auto v){ v=0;}); 您写道：std :: for_each（v.begin（），v.end（），[]（auto v）{v = 0;}）;

As v is a local argument of the operator() with no reads, I would expect it to become removed by your compiler. 由于v是operator()的局部参数，没有任何读取，所以我希望它会被编译器删除。 As you now have a loop with a body, that loop can be removed as well as there isn't an observable effect. 由于您现在有了一个带有主体的循环，因此可以除去该循环，并且没有明显的效果。 And similar to that, the vector can be removed as well as you don't have any readers. 与此类似，矢量也可以删除，因为您没有任何阅读器。

So, without any side effects, this could all be removed. 因此，没有任何副作用，可以将其全部消除。 If you would use a parallel algorithm, chances are you have some kind of synchronization, which make optimizing this much harder as there might be side effects in another thread? 如果您将使用并行算法，那么您可能会有某种同步，这会使优化变得更加困难，因为另一个线程可能会有副作用？ Proving it doesn't is more complex, not to mention the side effects of the thread management which could exist? 证明它并不更复杂，更不用说可能存在的线程管理的副作用了？

To solve this, a lot of benchmarks have trucks in macros to force the compiler to assume side effects. 为了解决这个问题，许多基准测试都在宏程序中添加了一些条件，以迫使编译器承担副作用。 Use them in the lambda so the compiler doesn't remove it. 在lambda中使用它们，以便编译器不会将其删除。

为什么std :: for_each比__gnu_parallel :: for_each快

问题描述

2 个解决方案

解决方案1
4 已采纳 2019-01-15 16:32:24

解决方案2
1 2019-01-15 18:52:38

为什么std :: for_each比__gnu_parallel :: for_each快

问题描述

2 个解决方案

解决方案1 4 已采纳 2019-01-15 16:32:24

解决方案2 1 2019-01-15 18:52:38

解决方案1
4 已采纳 2019-01-15 16:32:24

解决方案2
1 2019-01-15 18:52:38