C++17 並行算法 vs tbb 並行 vs openmp 性能

Question

由於 c++17 std 庫支持並行算法，我認為這將是我們的首選，但在與tbb和openmp比較之后，我改變了主意，我發現 std 庫要慢得多。

通過這篇文章，我想請教一下我是否應該放棄 std 庫的並行算法，使用tbb或openmp的專業建議，謝謝！

環境：

Mac OSX，卡特琳娜 10.15.7
GNU g++-10

基准代碼：

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 1000000;

double std_for() {
  auto values = std::vector<double>(N);

  size_t n_par = 5lu;
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;

  for (double value : values)
  {
    total += value;
  }
  return total;
}

double tbb_for() {
  auto values = std::vector<double>(N);

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;
  for (double value : values) {
    total += value;
  }
  return total;
}

double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

double seq_for()
{
  auto values = std::vector<double>(N);

  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

void time_it(double(*fn_ptr)(), const std::string& fn_name) {
  auto t1 = std::chrono::high_resolution_clock::now();
  auto rez = fn_ptr();
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
  std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}

int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it(&std_for, op);
  } else if (op == "omp_for") {
    time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

編譯選項：

g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp

結果：

std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910

我們可以看到std_for比seq_for （順序 for 循環）快，但它仍然比tbb和openmp慢得多。

更新

正如人們在評論中建議的那樣， for公平起見，我分別運行每個。 更新上面的代碼，結果如下，

>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885

>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619

>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052

>>> ./a.out std_for
std_for, rez = 500106, dur = 12423

就像 ppl 所說，與之前的結果相比，連續運行 4 個版本是不公平的。

Answer 1

您已經發現，究竟要測量什么以及如何進行測量很重要。 你的最終任務肯定會與這個簡單的練習完全不同，並且不能完全反映這里的結果。

除了受執行任務順序影響的緩存和預熱（您在更新的問題中明確研究了這一點）之外，您的示例中還應該考慮另一個問題。

實際的並行代碼才是最重要的。 如果這不能確定您的性能/運行時間，那么並行化不是正確的解決方案。 但是在您的示例中，您還測量了資源分配、初始化和最終計算。 如果這些驅動最終應用程序中的實際成本，那么並行化也不是靈丹妙葯。 因此，為了公平比較並真正衡量實際的並行代碼執行性能。 我建議沿着這條線修改你的代碼（對不起，我沒有安裝 openmp）並繼續你的學習：

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 10000000; // #1

void std_for(std::vector<double>& values, 
             std::vector<size_t> const& indices, 
             size_t const stride) {

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });
}

void tbb_for(std::vector<double>& values) {

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

}

/*
double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}
*/

void seq_for(std::vector<double>& values)
{
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }
}

void time_it(void(*fn_ptr)(std::vector<double>&), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}

void time_it_std(void(*fn_ptr)(std::vector<double>&, std::vector<size_t> const&, size_t const), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  size_t n_par = 5lu;  // #2
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;
  
  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values, indices, stride);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}



int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it_std(&std_for, op);
    //  } else if (op == "omp_for") {
    //time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

在我的（慢速）系統上，這會導致：

std_for, res = 5.00046e+06, dur = 66393
tbb_for，res = 5.00046e+06，dur = 51746
seq_for，res = 5.00046e+06，dur = 196156

我在這里注意到從 seq_for 到 tbb_for 的差異進一步增加了。 現在是 ~4x，而在您的示例中，它看起來更像 ~3x。 並且 std_for 仍然比 tbb_for 慢大約 20..30%。

但是，還有更多參數。 在將 N（見 #1）增加 10 倍（好吧，這不是很重要）和 n_par（見 #2）從 5 增加到 100（這很重要）之后，結果是

tbb_for，res = 5.00005e+07，dur = 486179
std_for, res = 5.00005e+07, dur = 479306

這里 std_for 與 tbb_for 相當！

因此，要回答您的問題：我顯然不會立即丟棄 c++17 標准並行化。

Answer 2

也許你已經知道了，但我沒有看到這里提到的事實是（至少對於 gcc 和 clang）PSTL 實際上是使用/支持 TBB、OpenMP（我相信目前僅在 clang 上）實現的，或者它的順序版本。

我猜你正在使用 libc++，因為你在 Mac 上； 據我所知，至少對於 Linux，LLVM 發行版沒有啟用 PSTL，如果從源代碼構建 PSTL 和 libcxx/libcxxabi，它默認為順序后端。

https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt

https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h

Answer 3

OpenMp 適用於直接並行編碼。
另一方面，TBB 使用工作竊取機制，可以為不平衡和嵌套的循環提供更好的性能。
我更喜歡 TBB 用於復雜和嵌套並行而不是 OpenMP。（OpenMP 對嵌套並行有巨大的開銷）

C++17 並行算法 vs tbb 並行 vs openmp 性能

問題描述

更新

3 個解決方案

解決方案1
1 2022-01-23 11:39:40

解決方案2
1 2022-06-17 15:27:35

解決方案3
0 2022-04-09 22:27:37

C++17 並行算法 vs tbb 並行 vs openmp 性能

問題描述

更新

3 個解決方案

解決方案1 1 2022-01-23 11:39:40

解決方案2 1 2022-06-17 15:27:35

解決方案3 0 2022-04-09 22:27:37

解決方案1
1 2022-01-23 11:39:40

解決方案2
1 2022-06-17 15:27:35

解決方案3
0 2022-04-09 22:27:37