bool数组上的Raw循环比transform或for_each快5倍

Question

Based on my previous experience with benchmarking transform and for_each, they usually perform slightly faster than raw loops and of course, they are safer, so I tried to replace all my raw loops with transform, generate and for_each. 基于我以前对基准测试转换和for_each的经验，它们通常比原始循环执行得快一点，当然，它们更安全，所以我尝试用transform，generate和for_each替换所有原始循环。 Today, I compared how fast I can flip booleans using for_each, transform and raw loops, and I got very surprising results. 今天，我比较了使用for_each，transform和raw循环可以快速翻转布尔值，我得到了非常令人惊讶的结果。 raw_loop performs 5 times faster than the other two. raw_loop的执行速度比其他两个快5倍。 I was not really able to find a good reason why we get this massive difference? 我真的不能找到一个很好的理由让我们得到这么大的差异？

#include <array>
#include <algorithm>


static void ForEach(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    std::for_each(a.begin(), a.end(), [](auto & arg) { arg = !arg; });
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(ForEach);

static void Transform(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    std::transform(a.begin(), a.end(), a.begin(), [](auto arg) { return !arg; });
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(Transform);


static void RawLoop(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    for (int i = 0; i < a.size(); i++) {
      a[i] = !a[i];
    }
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(RawLoop);

clang++ (7.0) -O3 -libc++ (LLVM) clang ++（7.0）-O3 -libc ++（LLVM）

Answer 1

In this example, clang vectorizes indexing but (mistakenly) fails to vectorize iterating. 在这个例子中，clang矢量化索引，但（错误地）无法矢量化迭代。

To summarize the results, there is no difference between using a raw loop and using std::transform or std::for_each . 总结结果， 使用原始循环和使用std::transform或std::for_each没有区别。 There IS, however, a difference between using indexing and using iterating, and for the purposes of this particular problem , clang is better at optimizing indexing than it is at optimizing iterating because indexing gets vectorized. 但是，使用索引和使用迭代之间存在差异，并且出于此特定问题的目的，clang在优化索引方面比在优化迭代时更好，因为索引得到了矢量化。 std::transform and std::for_each use iterating, so they end up being slower (when compiled under clang). std::transform和std::for_each使用迭代，因此它们最终变慢（在clang下编译时）。

What's the difference between indexing and iterating? 索引和迭代之间有什么区别？ - Indexing is when you use an integer to index into an array - Iterating is when you increment a pointer from begin() to end() . - 索引是指使用整数索引到数组时 - 迭代是指将指针从begin()递增到end() 。

Let's write the raw loop using indexing and using iterating, and compare the performance of iterating (with a raw loop) to indexing. 让我们使用索引编写原始循环并使用迭代，并将迭代（使用原始循环）的性能与索引进行比较。

// Indexing
for(int i = 0; i < a.size(); i++) {
    a[i] = !a[i];
}

// Iterating, used by std::for_each and std::transform
bool* begin = a.data();
bool* end   = begin + a.size(); 
for(; begin != end; ++begin) {
    *begin = !*begin; 
}

The example using indexing is better-optimized, and runs 4-5 times faster when compiled with clang. 使用索引的示例更好地进行了优化，并且在使用clang编译时运行速度提高了4-5倍。

In order to demonstrate this, let's add two additional tests, both using a raw loop. 为了证明这一点，让我们添加两个额外的测试，都使用原始循环。 One will use an iterator, and the other one will use raw pointers. 一个将使用迭代器，另一个将使用原始指针。


static void RawLoopIt(benchmark::State& state) {
  std::array<bool, 16> a;
  std::fill(a.begin(), a.end(), true); 

  for(auto _ : state) {
    auto scan = a.begin(); 
    auto end = a.end(); 
    for (; scan != end; ++scan) {
      *scan = !*scan; 
    }
    benchmark::DoNotOptimize(a); 
  }
 }

BENCHMARK(RawLoopIt); 

static void RawLoopPtr(benchmark::State& state) {
  std::array<bool, 16> a;
  std::fill(a.begin(), a.end(), true); 

  for(auto _ : state) {
    bool* scan = a.data(); 
    bool* end = scan + a.size(); 
    for (; scan != end; ++scan) {
      *scan = !*scan; 
    }
    benchmark::DoNotOptimize(a); 
  } 
}

BENCHMARK(RawLoopPtr);

When using a pointer or an iterator from begin to end , these functions identical in performance to using std::for_each or std::transform . 当使用指针或从一个迭代begin到end ，这些函数在性能上与利用相同 std::for_each或std::transform 。

Clang Quick-bench results: Clang Quick-bench成绩：

This is confirmed by running the clang benchmark locally: 这是通过在本地运行clang基准来确认的：

me@K2SO:~/projects/scratch$ clang++ -O3 bench.cpp -lbenchmark -pthread -o clang-bench
me@K2SO:~/projects/scratch$ ./clang-bench
2019-07-05 16:13:27
Running ./clang-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.44, 0.55, 0.59
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ForEach          8.32 ns         8.32 ns     83327615
Transform        8.29 ns         8.28 ns     82536410
RawLoop          1.92 ns         1.92 ns    361745495
RawLoopIt        8.31 ns         8.31 ns     81848945
RawLoopPtr       8.28 ns         8.28 ns     82504276

GCC does not have this problem. 海湾合作委员会没有这个问题。

For the purposes of this example, there is no fundamental difference between indexing or iterating. 出于此示例的目的，索引或迭代之间没有根本区别。 Both of them apply an identical transformation to the array, and the compiler should be able to compile them identically. 它们都对数组应用了相同的转换，编译器应该能够以相同的方式编译它们。

Indeed, GCC is able to do this, with all methods running faster than the corresponding version compiled under clang. 实际上，GCC能够做到这一点，所有方法的运行速度都比在clang下编译的相应版本快。

GCC Quick-bench results: GCC Quick-bench结果：

GCC Local results: GCC当地结果：

2019-07-05 16:13:35
Running ./gcc-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.52, 0.57, 0.60
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ForEach          1.43 ns         1.43 ns    484760981
Transform        1.44 ns         1.44 ns    485788409
RawLoop          1.43 ns         1.43 ns    484973417
RawLoopIt        1.44 ns         1.44 ns    482685685
RawLoopPtr       1.44 ns         1.44 ns    483736235

Is indexing actually faster than iterating? 索引实际上比迭代更快吗？ No. Indexing is faster because clang vectorizes it. 不会。索引更快，因为clang会对其进行矢量化。

Under the hood, neither iterating nor indexing occurs. 在引擎盖下， 既不进行迭代也不进行索引。 Instead, gcc and clang vectorize the operation by treating the array as two 64-bit integers, and using a bitwise-xor on them. 相反，gcc和clang通过将数组视为两个64位整数并对它们使用bitwise-xor来向量化操作。 We can see this reflected in the assembly used to flip the bits: 我们可以看到这反映在用于翻转位的组件中：

       movabs $0x101010101010101,%rax
       nopw   %cs:0x0(%rax,%rax,1)
       xor    %rax,(%rsp)
       xor    %rax,0x8(%rsp)
       sub    $0x1,%rbx

Iterating is slower when compiled by clang because for some reason, clang fails to vectorize the operation when iterating is used. 由clang编译时迭代速度较慢，因为出于某种原因， 当使用迭代时 ， clang无法对操作进行向量化。 This is a defect in clang, and one that applies specifically to this problem. 这是clang中的缺陷，并且特别适用于此问题。 As clang improves, this discrepancy should disappear, and it's not something I would worry about for now. 随着铿锵声的改善，这种差异应该会消失，而现在我不会担心这种情况。

Don't micro-optimize. 不要微观优化。 Let the compiler handle that, and if necessary, test whether gcc or clang produces faster code for your particular use-case . 让编译器处理它，并在必要时测试gcc或clang是否为您的特定用例生成更快的代码。 Neither is better for all cases. 对所有情况都没有好处。 For example, clang is better at vectorizing some math operations. 例如，clang更适合矢量化一些数学运算。

bool数组上的Raw循环比transform或for_each快5倍

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-05 22:46:50

In this example, clang vectorizes indexing but (mistakenly) fails to vectorize iterating. 在这个例子中，clang矢量化索引，但（错误地）无法矢量化迭代。

GCC does not have this problem. 海湾合作委员会没有这个问题。

Is indexing actually faster than iterating? 索引实际上比迭代更快吗？ No. Indexing is faster because clang vectorizes it. 不会。索引更快，因为clang会对其进行矢量化。

bool数组上的Raw循环比transform或for_each快5倍

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-05 22:46:50

In this example, clang vectorizes indexing but (mistakenly) fails to vectorize iterating. 在这个例子中，clang矢量化索引，但（错误地）无法矢量化迭代。

GCC does not have this problem. 海湾合作委员会没有这个问题。

Is indexing actually faster than iterating? 索引实际上比迭代更快吗？ No. Indexing is faster because clang vectorizes it. 不会。索引更快，因为clang会对其进行矢量化。

解决方案1
1 已采纳 2019-07-05 22:46:50