分支上的缓存未命中罚款

Question

I wonder is it faster to replace branching with 2 multiplications or no (due to cache miss penalty)? 我想知道用2个乘法代替分支是否更快（由于高速缓存未命中罚款）？
Here is my case: 这是我的情况：

float dot = rib1.x*-dir.y + rib1.y*dir.x;

if(dot<0){
    dir.x = -dir.x;
    dir.y = -dir.y;
}

And I'm trying to replace it with: 我正在尝试将其替换为：

float dot = rib1.x*-dir.y + rib1.y*dir.x;

int sgn = (dot  < 0.0) - (0.0 < dot ); //returns -1 or 1 (no branching here, tested)
dir.x *= sgn;
dir.y *= sgn;

Answer 1

Branching does not imply cache miss: only instruction prefetching/pipelining is disturbed, so it's possible you block some SSE optimization at compile-time with it. 分支并不意味着高速缓存未命中：只有指令的预取/流水线受到干扰，因此有可能在编译时使用它来阻止一些SSE优化。

On the other side, if x86 instructions are being used only, the speculative execution will let the processor to properly start the execution of the most used branch. 另一方面，如果仅使用x86指令，则推测执行将使处理器正确启动最常用的分支的执行。

On the other side, if you enter the if for the 50% of the times you are in the worst condition: in this case I'd try to look for SSE pipelining and to have the execution optimized with SSE, probably getting some hints from this post , in line with your second block of code. 另一方面，如果您输入的if在50％的情况下处于最坏的情况：在这种情况下，我将尝试查找SSE流水线并使用SSE优化执行，可能会从中得到一些提示这篇文章，与您的第二段代码一致。

However, benchmark your code, check the produced assembler in order to find the best solution for this optimization, and get the proper insight. 但是，对您的代码进行基准测试，检查生产的汇编程序，以找到用于此优化的最佳解决方案，并获得正确的见解。 And eventually keep us updated :) 并最终使我们保持更新:)

Answer 2

The cost of the multiplication depends on several factors, whether you use 32-bit or 64-bit floats, and whether you enable SSE or not. 乘法的成本取决于几个因素，是使用32位还是64位浮点数，以及是否启用SSE。 The cost of two float multiplications is 10 cycles according to this source: http://www.agner.org/optimize/instruction_tables.pdf 根据此消息来源，两个float乘法的成本为10个周期： http ： //www.agner.org/optimize/instruction_tables.pdf

The cost of the branch also depends on several factors. 分支机构的成本还取决于几个因素。 As a rule of thumb, do not worry about branches in your code. 根据经验，不必担心代码中的分支。 The exact behaviour of the branch predictor on the CPU will define the performance, but in this case you should probably expect that the branch will be unpredictable at best, so this is likely to lead to a lot of branch mispredictions. 分支预测器在CPU上的确切行为将定义性能，但是在这种情况下，您应该期望分支充其量是不可预测的，因此这很可能导致许多分支预测错误。 The cost of a branch misprediction is 10-30 cycles according to this source: http://valgrind.org/docs/manual/cg-manual.html 根据此消息来源，分支错误预测的成本为10到30个周期： http : //valgrind.org/docs/manual/cg-manual.html

The best advice anyone can give here is to profile and test. 任何人都可以在此处提供的最佳建议是进行概要分析和测试。 I would guess that on a modern Core i7 the two multiplications should be faster than the branch, if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication . 我猜想在现代的Core i7上， if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication ， if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication两个乘法应该比分支更快。

Assuming 50% miss rate, the cost of the branch averages 15 cycles (30 * 0.5), the cost of the float mul is 10 cycles. 假设未命中率为50％，则分支的成本平均为15个周期（30 * 0.5），浮点mul的成本为10个周期。

EDIT : Added links, updated estimated instruction cost. 编辑：添加了链接，更新了估计的教学成本。

分支上的缓存未命中罚款

问题描述

2 个解决方案

解决方案1
2 2014-03-22 23:45:32

解决方案2
1 已采纳 2014-03-22 23:45:57

分支上的缓存未命中罚款

问题描述

2 个解决方案

解决方案1 2 2014-03-22 23:45:32

解决方案2 1 已采纳 2014-03-22 23:45:57

解决方案1
2 2014-03-22 23:45:32

解决方案2
1 已采纳 2014-03-22 23:45:57