整数上的无分支条件 - 速度快，但它们可以更快吗？

Question

I've been experimenting with the following and have noticed that the branchless “if” defined here (now with &-!! replacing *!! ) can speed up certain bottleneck code by as much as (almost) 2x on 64-bit Intel targets with clang: 我一直在尝试以下内容，并注意到这里定义的无分支“if”（现在使用&-!! replacement *!! ）可以在64位Intel上加速某些瓶颈代码（几乎）2倍clang的目标：

// Produces x if f is true, else 0 if f is false.
#define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))

// Produces x if f is true, else y if f is false.
#define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                     ((y) & -((typeof(y)) !(f))))

Note that f should be a reasonably simple expression with no side-effects, so that the compiler is able to do its best optimizations. 请注意， f应该是一个相当简单的表达式，没有副作用，因此编译器能够进行最佳的优化。

Performance is highly dependent on CPU and compiler. 性能高度依赖于CPU和编译器。 The branchless 'if' performance is excellent with clang; clang的无分支'if'表现非常出色; I haven't found any cases yet where the branchless 'if/else' is faster, though. 我还没有找到任何无分支的'if / else'更快的情况。

My question is: are these safe and portable as written (meaning guaranteed to give correct results on all targets), and can they be made faster? 我的问题是：这些是安全的，可移植的吗（意味着可以保证在所有目标上得到正确的结果），并且可以更快地制作吗？

Example usage of branchless if/else 无分支if / else的示例用法

These compute 64-bit minimum and maximum. 这些计算64位最小值和最大值。

inline uint64_t uint64_min(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a <= b), a, b);
}

inline uint64_t uint64_max(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a >= b), a, b);
}

Example usage of branchless if 无分支if的示例用法

This is 64-bit modular addition — it computes (a + b) % n . 这是64位模块化添加 - 它计算(a + b) % n 。 The branching version (not shown) suffers terribly from branch prediction failures, but the branchless version is very fast (at least with clang). 分支版本（未示出）受到分支预测失败的严重影响，但无分支版本非常快（至少与clang一样）。

inline uint64_t uint64_add_mod(uint64_t a, uint64_t b, uint64_t n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  uint64_t c = a + b - BRANCHLESS_IF((a >= n - b), n);

  assert(c < n);
  return c;
}

Update: Full concrete working example of branchless if 更新：无分支的完整具体工作示例if

Below is a full working C11 program that demonstrates the speed difference between branching and a branchless versions of a simple if conditional, if you would like to try it on your system. 下面是一个完整的C11程序，它演示了分支和简单if条件的无分支版本之间的速度差异，如果你想在你的系统上尝试它。 The program computes modular exponentiation, that is (a ** b) % n , for extremely large values. 该程序计算模幂运算，即(a ** b) % n ，用于极大值。

To compile, use the following on the command line: 要编译，请在命令行中使用以下命令：

-O3 (or whatever high optimization level you prefer) -O3 （或者您喜欢的任何高优化级别）
-DNDEBUG (to disable assertions, for speed) -DNDEBUG （禁用断言，速度）
Either -DBRANCHLESS=0 or -DBRANCHLESS=1 to specify branching or branchless behavior, respectively -DBRANCHLESS=0或-DBRANCHLESS=1分别指定分支或无分支行为

On my system, here's what happens: 在我的系统上，这是发生的事情：

$ cc -DBRANCHLESS=0 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 0
CPU time:  21.83 seconds
foo = 10585369126512366091

$ cc -DBRANCHLESS=1 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 1
CPU time:  11.76 seconds
foo = 10585369126512366091

$ cc --version
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz. Intel Core i7). 因此，无分支版本几乎是我系统上分支版本的两倍（3.4 GHz。英特尔酷睿i7）。

// SPEED TEST OF MODULAR MULTIPLICATION WITH BRANCHLESS CONDITIONALS

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>

typedef  uint64_t  uint64;

//------------------------------------------------------------------------------
#if BRANCHLESS
  // Actually branchless.
  #define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))
  #define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                       ((y) & -((typeof(y)) !(f))))
#else
  // Not actually branchless, but used for comparison.
  #define  BRANCHLESS_IF(f,x)          ((f)? (x) : 0)
  #define  BRANCHLESS_IF_ELSE(f,x,y)   ((f)? (x) : (y))
#endif

//------------------------------------------------------------------------------
// 64-bit modular multiplication.  Computes (a * b) % n without division.

static uint64 uint64_mul_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  if (a < b) { uint64 t = a; a = b; b = t; }  // Ensure that b <= a.

  uint64 c = 0;
  for (; b != 0; b /= 2)
  {
    // This computes c = (c + a) % n if (b & 1).
    c += BRANCHLESS_IF((b & 1), a - BRANCHLESS_IF((c >= n - a), n));
    assert(c < n);

    // This computes a = (a + a) % n.
    a += a - BRANCHLESS_IF((a >= n - a), n);
    assert(a < n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
// 64-bit modular exponentiation.  Computes (a ** b) % n using modular
// multiplication.

static
uint64 uint64_pow_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n);

  uint64 c = 1;

  for (; b > 0; b /= 2)
  {
    if (b & 1)
      c = uint64_mul_mod(c, a, n);

    a = uint64_mul_mod(a, a, n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
int main(const int argc, const char *const argv[const])
{
  printf("BRANCHLESS = %d\n", BRANCHLESS);

  clock_t clock_start = clock();

  #define SHOW_RESULTS 0

  uint64 foo = 0;  // Used in forcing compiler not to throw away results.

  uint64 n = 3, a = 1, b = 1;
  const uint64 iterations = 1000000;
  for (uint64 iteration = 0; iteration < iterations; iteration++)
  {
    uint64 c = uint64_pow_mod(a%n, b, n);

    if (SHOW_RESULTS)
    {
      printf("(%"PRIu64" ** %"PRIu64") %% %"PRIu64" = %"PRIu64"\n",
             a%n, b, n, c);
    }
    else
    {
      foo ^= c;
    }

    n = n * 3 + 1;
    a = a * 5 + 3;
    b = b * 7 + 5;
  }

  clock_t clock_end = clock();
  double elapsed = (double)(clock_end - clock_start) / CLOCKS_PER_SEC;
  printf("CPU time:  %.2f seconds\n", elapsed);

  printf("foo = %"PRIu64"\n", foo);

  return 0;
}

Second update: Intel vs. ARM performance 第二次更新：英特尔与ARM的性能

Testing on 32-bit ARM targets (iPhone 3GS/4S, iPad 1/2/3/4, as compiled by Xcode 6.1 with clang) reveals that the branchless “if” here is actually about 2–3 times slower than ternary ?: for the modular exponentiation code in those cases. 测试32位ARM目标（iPhone 3GS / 4S，iPad 1/2/3/4，由Xcode 6.1和clang编译）显示，这里的无分支“if”实际上比三元慢约2-3倍?:对于那些情况下的模幂运算代码。 So it seems that these branchless macros are not a good idea if maximum speed is needed, although they might be useful in rare cases where constant speed is needed. 因此，如果需要最大速度，这些无分支宏似乎不是一个好主意，尽管它们可能在需要恒定速度的极少数情况下有用。
On 64-bit ARM targets (iPhone 6+, iPad 5), the branchless “if” runs the same speed as ternary ?: — again as compiled by Xcode 6.1 with clang. 在64位ARM目标（iPhone 6 +，iPad 5）上，无分支“if”的运行速度与三元相同?: - 再次由Xcode 6.1和clang编译。
For both Intel and ARM (as compiled by clang), the branchless “if/else” was about twice as slow as ternary ?: for computing min/max. 对于Intel和ARM（由clang编译），无分支“if / else”的速度大约是三元的两倍?:用于计算min / max。

Answer 1

Sure this is portable, the ! 当然这是便携式的! operator is guaranteed to give either 0 or 1 as a result. 运算符保证结果为0或1 。 This then is promoted to whatever type is needed by the other operand. 然后将其提升为另一个操作数所需的任何类型。

As others observed, your if-else version has the disadvantage to evaluate twice, but you already know that, and if there is no side effect you are fine. 正如其他人观察到的那样，你的if-else版本有两次评估的缺点，但你已经知道了，如果没有副作用你就没事了。

What surprises me is that you say that this is faster. 令我惊讶的是，你说这更快。 I would have thought that modern compilers perform that sort of optimization themselves. 我原以为现代编译器本身会进行这种优化。

Edit: So I tested this with two compilers (gcc and clang) and the two values for the configuration. 编辑：所以我测试了两个编译器（gcc和clang）以及配置的两个值。

In fact, if you don't forget to set -DNDEBUG=1 , the 0 version with ?: is much better for gcc and does what I would have it expected to do. 事实上，如果你不忘记设置-DNDEBUG=1 ，那么带有?:的0版本对于gcc来说要好得多，并且做了我想做的事情。 It basically uses conditional moves to have the loop branchless. 它基本上使用条件移动使循环无分支。 In that case clang doesn't find this sort of optimization and does some conditional jumps. 在那种情况下，clang没有找到这种优化并进行一些条件跳转。

For the version with arithmetic, the performance for gcc worsens. 对于具有算术运算的版本，gcc的性能会恶化。 In fact seeing what he does this is not surprising. 事实上，看到他做了什么，这并不奇怪。 It really uses imul instructions, and these are slow. 它真的使用imul指令，而且速度很慢。 clang gets off better here. clang在这里下车得更好。 The "arithmetic" actually has optimized the multiplication out and replaced them by conditional moves. “算术”实际上已经优化了乘法，并通过条件移动替换它们。

So to summarize, yes, this is portable, but if this brings performance improvement or worsening will depend on your compiler, its version, the compile flags that you are applying, the potential of your processor ... 总而言之，是的，这是可移植的，但如果这带来性能提升或恶化将取决于您的编译器，其版本，您正在应用的编译标志，您的处理器的潜力......

整数上的无分支条件 - 速度快，但它们可以更快吗？

问题描述

Update: Full concrete working example of branchless if 更新：无分支的完整具体工作示例if

Second update: Intel vs. ARM performance 第二次更新：英特尔与ARM的性能

1 个解决方案

解决方案1
6 2015-08-08 21:17:55

整数上的无分支条件 - 速度快，但它们可以更快吗？

问题描述

Update: Full concrete working example of branchless if 更新：无分支的完整具体工作示例if

Second update: Intel vs. ARM performance 第二次更新：英特尔与ARM的性能

1 个解决方案

解决方案1 6 2015-08-08 21:17:55

解决方案1
6 2015-08-08 21:17:55