C++17 并行算法 vs tbb 并行 vs openmp 性能

Question

Since c++17 std library support parallel algorithm, I thought it would be the go-to option for us, but after comparing with tbb and openmp , I changed my mind, I found the std library is much slower.由于 c++17 std 库支持并行算法，我认为这将是我们的首选，但在与tbb和openmp比较之后，我改变了主意，我发现 std 库要慢得多。

By this post, I want to ask for professional advice about whether I should abandon the std library's parallel algorithm, and use tbb or openmp , thanks!通过这篇文章，我想请教一下我是否应该放弃 std 库的并行算法，使用tbb或openmp的专业建议，谢谢！

Env:环境：

Mac OSX, Catalina 10.15.7 Mac OSX，卡特琳娜 10.15.7
GNU g++-10 GNU g++-10

Benchmark code:基准代码：

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 1000000;

double std_for() {
  auto values = std::vector<double>(N);

  size_t n_par = 5lu;
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;

  for (double value : values)
  {
    total += value;
  }
  return total;
}

double tbb_for() {
  auto values = std::vector<double>(N);

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;
  for (double value : values) {
    total += value;
  }
  return total;
}

double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

double seq_for()
{
  auto values = std::vector<double>(N);

  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

void time_it(double(*fn_ptr)(), const std::string& fn_name) {
  auto t1 = std::chrono::high_resolution_clock::now();
  auto rez = fn_ptr();
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
  std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}

int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it(&std_for, op);
  } else if (op == "omp_for") {
    time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

Compile options:编译选项：

g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp

Results:结果：

std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910

We can see that std_for is faster than seq_for (sequential for-loop), but it's still much slower than tbb and openmp .我们可以看到std_for比seq_for （顺序 for 循环）快，但它仍然比tbb和openmp慢得多。

UPDATE更新

As people suggested in comments, I run each for separately to be fair.正如人们在评论中建议的那样， for公平起见，我分别运行每个。 The above code is updated, and results as follows,更新上面的代码，结果如下，

>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885

>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619

>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052

>>> ./a.out std_for
std_for, rez = 500106, dur = 12423

And like ppl said, running the 4 versions in a row is not fair, compared to the previous results.就像 ppl 所说，与之前的结果相比，连续运行 4 个版本是不公平的。

Answer 1

You already found that it matters what exactly is to be measured and how this is done.您已经发现，究竟要测量什么以及如何进行测量很重要。 Your final task will certainty be quite different from this simple exercise and not entirely reflect the results found here.你的最终任务肯定会与这个简单的练习完全不同，并且不能完全反映这里的结果。

Besides caching and warming-up that are affected by the sequence of doing tasks (you studied this explicitly in your updated question) there is also another issue in your example you should consider.除了受执行任务顺序影响的缓存和预热（您在更新的问题中明确研究了这一点）之外，您的示例中还应该考虑另一个问题。

The actual parallel code is what matters.实际的并行代码才是最重要的。 If this does not determine your performance/runtime than parallelization is not the right solution.如果这不能确定您的性能/运行时间，那么并行化不是正确的解决方案。 But in your example you measure also resource allocation, initialization and final computation.但是在您的示例中，您还测量了资源分配、初始化和最终计算。 If those drive the real costs in your final application, again, parallelization is not the silver bullet.如果这些驱动最终应用程序中的实际成本，那么并行化也不是灵丹妙药。 Thus, for a fair comparison and to really measure the actual parallel code execution performance.因此，为了公平比较并真正衡量实际的并行代码执行性能。 I suggest to modify your code along this line (sorry, I don't have openmp installed) and continue your studies:我建议沿着这条线修改你的代码（对不起，我没有安装 openmp）并继续你的学习：

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 10000000; // #1

void std_for(std::vector<double>& values, 
             std::vector<size_t> const& indices, 
             size_t const stride) {

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });
}

void tbb_for(std::vector<double>& values) {

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

}

/*
double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}
*/

void seq_for(std::vector<double>& values)
{
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }
}

void time_it(void(*fn_ptr)(std::vector<double>&), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}

void time_it_std(void(*fn_ptr)(std::vector<double>&, std::vector<size_t> const&, size_t const), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  size_t n_par = 5lu;  // #2
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;
  
  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values, indices, stride);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}



int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it_std(&std_for, op);
    //  } else if (op == "omp_for") {
    //time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

On my (slow) system this results in:在我的（慢速）系统上，这会导致：

std_for, res = 5.00046e+06, dur = 66393 std_for, res = 5.00046e+06, dur = 66393
tbb_for, res = 5.00046e+06, dur = 51746 tbb_for，res = 5.00046e+06，dur = 51746
seq_for, res = 5.00046e+06, dur = 196156 seq_for，res = 5.00046e+06，dur = 196156

I note here that the difference from seq_for to tbb_for has further increased.我在这里注意到从 seq_for 到 tbb_for 的差异进一步增加了。 It is now ~4x while in your example it looks more like ~3x.现在是 ~4x，而在您的示例中，它看起来更像 ~3x。 And std_for is still about 20..30% slower than tbb_for.并且 std_for 仍然比 tbb_for 慢大约 20..30%。

However , there are further parameters.但是，还有更多参数。 After increasing N (see #1) by a factor of 10 (ok, this is not very important) and n_par (see #2) from 5 to 100 (this is important ) the results are在将 N（见 #1）增加 10 倍（好吧，这不是很重要）和 n_par（见 #2）从 5 增加到 100（这很重要）之后，结果是

tbb_for, res = 5.00005e+07, dur = 486179 tbb_for，res = 5.00005e+07，dur = 486179
std_for, res = 5.00005e+07, dur = 479306 std_for, res = 5.00005e+07, dur = 479306

Here std_for is on-par with tbb_for!这里 std_for 与 tbb_for 相当！

Thus, to answer your question: I clearly would NOT discard c++17 std parallelization right away.因此，要回答您的问题：我显然不会立即丢弃 c++17 标准并行化。

Answer 2

Perhaps you already know, but something I don't see mentioned here is the fact that (at least for gcc and clang) the PSTL is actually implemented using/backended by TBB, OpenMP (currently on clang, only, I believe), or a sequential version of it.也许你已经知道了，但我没有看到这里提到的事实是（至少对于 gcc 和 clang）PSTL 实际上是使用/支持 TBB、OpenMP（我相信目前仅在 clang 上）实现的，或者它的顺序版本。

I'm guessing you're using libc++ since you are on Mac;我猜你正在使用 libc++，因为你在 Mac 上； as far as I know, for Linux at least, the LLVM distributions do not come with the PSTL enabled, and if building PSTL and libcxx/libcxxabi from source, it defaults to a sequential backend.据我所知，至少对于 Linux，LLVM 发行版没有启用 PSTL，如果从源代码构建 PSTL 和 libcxx/libcxxabi，它默认为顺序后端。

https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt

https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h

Answer 3

OpenMp is good for straight forward parallel codding. OpenMp 适用于直接并行编码。
On the other hand TBB use work-stealing mechanism which can give you better performance for loops that are imbalance and nested.另一方面，TBB 使用工作窃取机制，可以为不平衡和嵌套的循环提供更好的性能。
I prefer TBB for complex and nested parallelism over OpenMP.(OpenMP has a huge over-head for the nested parallelism)我更喜欢 TBB 用于复杂和嵌套并行而不是 OpenMP。（OpenMP 对嵌套并行有巨大的开销）

C++17 并行算法 vs tbb 并行 vs openmp 性能

问题描述

UPDATE更新

3 个解决方案

解决方案1
1 2022-01-23 11:39:40

解决方案2
1 2022-06-17 15:27:35

解决方案3
0 2022-04-09 22:27:37

C++17 并行算法 vs tbb 并行 vs openmp 性能

问题描述

UPDATE更新

3 个解决方案

解决方案1 1 2022-01-23 11:39:40

解决方案2 1 2022-06-17 15:27:35

解决方案3 0 2022-04-09 22:27:37

解决方案1
1 2022-01-23 11:39:40

解决方案2
1 2022-06-17 15:27:35

解决方案3
0 2022-04-09 22:27:37