为什么树矢量化使这种排序算法慢2倍？

Question

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). 如果在gcc（4.7.2）中启用-fprofile-arcs则此问题的排序算法会快两倍（！）。 The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler): 该问题的大量简化的C代码（事实证明我可以用全零来初始化数组，奇怪的性能行为仍然存在，但它使得推理更加简单）：

#include <time.h>
#include <stdio.h>

#define ELEMENTS 100000

int main() {
  int a[ELEMENTS] = { 0 };
  clock_t start = clock();
  for (int i = 0; i < ELEMENTS; ++i) {
    int lowerElementIndex = i;
    for (int j = i+1; j < ELEMENTS; ++j) {
      if (a[j] < a[lowerElementIndex]) {
        lowerElementIndex = j;
      }
    }
    int tmp = a[i];
    a[i] = a[lowerElementIndex];
    a[lowerElementIndex] = tmp;
  } 
  clock_t end = clock();
  float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
  printf("Time: %2.3f\n", timeExec);
  printf("ignore this line %d\n", a[ELEMENTS-1]);
}

After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. 在使用优化标志很长一段时间之后，事实证明-ftree-vectorize也会产生这种奇怪的行为，因此我们可以将-fprofile-arcs排除在外。 After profiling with perf I have found that the only relevant difference is: 在使用perf进行分析后，我发现唯一相关的区别是：

Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s) 快速案例gcc -std=c99 -O2 simp.c （运行于3.1s）

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s) 慢速gcc -std=c99 -O2 -ftree-vectorize simp.c （运行于6.1s）

    cmpl    %ecx, %esi
    cmovl   %edx, %edi
    cmovl   %esi, %ecx

As for the first snippet: Given that the array only contains zeros, we always jump to .L3 . 至于第一个片段：假设数组只包含零，我们总是跳转到.L3 。 It can greatly benefit from branch prediction. 它可以从分支预测中大大受益。

I guess the cmovl instructions cannot benefit from branch prediction. 我猜cmovl指令不能从分支预测中受益。

Questions: 问题：

Are all my above guesses correct? 以上所有猜测都是正确的吗？ Does this make the algorithm slow? 这会使算法变慢吗？
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible? 如果是，我怎么能阻止gcc发出这条指令（当然除了琐碎的-fno-tree-vectorization解决方法之外），但仍然尽可能多地进行优化？
What is this -ftree-vectorization ? 什么是-ftree-vectorization ？ The documentation is quite vague, I would need a little more explanation to understand what's happening. 文档很模糊，我需要更多解释来了解发生了什么。

Update: Since it came up in comments: The weird performance behavior wrt the -ftree-vectorize flag remains with random data. 更新：因为它出现在评论中： -ftree-vectorize标志的奇怪性能行为保留随机数据。 As Yakk points out , for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions. 正如Yakk指出的那样，对于选择排序，实际上很难创建一个会导致很多分支错误预测的数据集。

Since it also came up: I have a Core i5 CPU. 既然它也出现了：我有一个Core i5 CPU。

Based on Yakk's comment , I created a test. 根据Yakk的评论，我创建了一个测试。 The code below ( online without boost ) is of course no longer a sorting algorithm; 下面的代码（在线没有提升）当然不再是排序算法; I only took out the inner loop. 我只拿出了内循环。 Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p . 它唯一的目标是检查分支预测的效果： 我们以概率p跳过for循环中的if分支。

#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8; 
constexpr double p = 0.50;

int main() {
  printf("p = %.2f\n", p);
  int* a = new int[ELEMENTS];
  mt19937 mt(1759);
  bernoulli_distribution rnd(p);
  for (int i = 0 ; i < ELEMENTS; ++i){
    a[i] = rnd(mt)? i : -i;
  }
  auto start = high_resolution_clock::now();
  int lowerElementIndex = 0;
  for (int i=0; i<ELEMENTS; ++i) {
    if (a[i] < a[lowerElementIndex]) {
      lowerElementIndex = i;
    }
  } 
  auto finish = high_resolution_clock::now();
  printf("%ld  ms\n", duration_cast<milliseconds>(finish-start).count());
  printf("Ignore this line   %d\n", a[lowerElementIndex]);
  delete[] a;
}

The loops of interest: 感兴趣的循环：

This will be referred to as cmov 这将被称为cmov

g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L30:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    movslq  %eax, %rdx
    cmovl   %rdx, %rbp
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L30

This will be referred to as no cmov , the -fno-if-conversion flag was pointed out by Turix in his answer. 这将被称为no cmov ， Turix在他的回答中指出了-fno-if-conversion标志。

g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L29:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    jge .L28
    movslq  %eax, %rbp
.L28:
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L29

The difference side by side 差异并排

cmpl    %edx, (%rbx,%rax,4) |     cmpl  %edx, (%rbx,%rax,4)
movslq  %eax, %rdx          |     jge   .L28
cmovl   %rdx, %rbp          |     movslq    %eax, %rbp
                            | .L28:

The execution time as a function of the Bernoulli parameter p 作为伯努利参数p的函数的执行时间

分支预测的效果

The code with the cmov instruction is absolutely insensitive to p . 带有cmov指令的代码对p完全不敏感。 The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster ( p=1 ). 如果p<0.26或0.81<p并且最多快4.38倍（ p=1 ），则没有 cmov指令的代码是获胜者。 Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction. 当然，分支预测器的最坏情况是在p=0.5左右，其中代码比使用cmov指令的代码慢cmov 。

Answer 1

Note: Answered before graph update was added to the question; 注意：图表更新之前已回答问题; some assembly code references here may be obsolete. 这里的一些汇编代码引用可能已经过时了。

(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.) （从我们上面的聊天中进行了改编和扩展，这足以激励我做更多的研究。）

First (as per our above chat), it appears that the answer to your first question is "yes". 首先（根据我们的上述聊天），您的第一个问题的答案似乎是“是”。 In the vector "optimized" code, the optimization (negatively) affecting performance is branch predic a tion , whereas in the original code the performance is (positively) affected by branch prediction . 在矢量“优化的”码中，最优化（带负）影响性能是分支predic 部件的位置 ，而在原始代码的性能是（正）受分支预测。 (Note the extra ' a ' in the former.) （注意前者的额外' a '。）

Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop: 回答第3个问题：即使在你的情况下，实际上没有进行矢量化，从步骤11（“条件执行”）开始，似乎与矢量化优化相关的步骤之一是在目标循环内“平坦化”条件，喜欢循环中的这一点：

if (a[j] < a[lowerElementIndex]
    lowerElementIndex = j;

Apparently, this happens even if there is no vectorization. 显然，即使没有矢量化，也会发生这种情况。

This explains why the compiler is using the conditional move instructions ( cmovl ). 这解释了编译器使用条件移动指令（ cmovl ）的原因。 The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). 目标是完全避免分支（而不是试图正确预测）。 Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (ie, prior to them actually taking effect). 相反，两个cmovl指令将在前一个cmpl的结果已知之前从管道向下发送，然后比较结果将被“转发”以在它们的回写之前启用/阻止移动（即，在它们实际生效之前））。

Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel. 注意，如果循环已被矢量化，那么这可能是值得的，以便能够有效地并行完成循环的多次迭代。

However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. 但是，在您的情况下，优化尝试实际上是逆火，因为在展平循环中，两个条件移动通过循环每次都通过管道发送。 This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move ( cmovl %esi, %ecx ) to have to wait until the array/memory access ( movl (%rsp,%rsi,4), %esi ) is completed, even if the result is going to be ultimately ignored. 这本身也可能不是那么糟糕，除了有一个RAW数据危险导致第二次移动（ cmovl %esi, %ecx ）必须等到数组/内存访问（ movl (%rsp,%rsi,4), %esi ）完成，即使结果最终会被忽略。 Hence the huge time spent on that particular cmovl . 因此花费在特定cmovl上的巨大时间。 (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.) （我希望这是一个问题，你的处理器没有足够复杂的逻辑内置到其预测/转发实现中来处理危险。）

On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). 另一方面，在非优化的情况下，正如您所知，分支预测可以帮助避免必须等待相应的数组/内存访问的结果（ movl (%rsp,%rcx,4), %ecx指令）。 In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be 在这种情况下，当处理器正确地预测一个被采用的分支（对于一个全0的数组将是每一次，但是[偶数]在随机数组中应该[仍然] roughly 大致 more than [edited per @Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. 超过 [编辑每@ Yakk的评论]一半的时间），它不必等待内存访问完成继续并在循环中排队接下来的几条指令。 So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline. 因此，在正确的预测中，你得到了提升，而在不正确的预测中，结果并不比“优化”情况更差，而且更好，因为有时能够避免在管道中使用2“浪费的” cmovl指令。

[The following was removed due to my mistaken assumption about your processor per your comment.] [由于我根据您的评论错误地假设您的处理器，因此删除了以下内容。]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway. 回到你的问题，我建议查看上面的链接，了解更多有关矢量化的标志，但最后，我很确定忽略优化，因为你的Celeron无法使用它（在这种情况下）无论如何。

[Added after above was removed] [删除上面后添加]
Re your second question (" ...how can I prevent gcc from emitting this instruction... "), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (ie, I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given @Yakk's point about your sort algorithm). 重新提出你的第二个问题（“ ......我怎么能防止gcc发出这条指令... ”），你可以尝试-fno-if-conversion和-fno-if-conversion2标志（不确定这些是否总能正常工作 - - 它们不再适用于我的mac），虽然我不认为你的问题通常是cmovl指令（即，我不会总是使用那些标志），只是在这个特定的上下文中使用它（其中分支预测是鉴于@ Yakk关于排序算法的观点，将会非常有用。

为什么树矢量化使这种排序算法慢2倍？

问题描述

1 个解决方案

解决方案1
10 已采纳 2014-01-11 00:55:04

为什么树矢量化使这种排序算法慢2倍？

问题描述

1 个解决方案

解决方案1 10 已采纳 2014-01-11 00:55:04

解决方案1
10 已采纳 2014-01-11 00:55:04