[英]optimize 32-bit value construction

So, I have the following code: 因此,我有以下代码:

uint32_t val;
if (swap) {
   val = ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16);
} else {
   val = ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);

Is there a way to optimize it, and have swap checking somehow embedded in the statement? 有没有一种方法可以对其进行优化,并以某种方式swap检查嵌入到语句中?

If the objective is to avoid a branch, then you can write this: 如果目标是避免分支,则可以编写以下代码:

val = ((!!swap) * (uint32_t)a + (!swap) * (uint32_t)b) & 0x0000ffff)
        | (((!!swap) * (uint32_t)b + (!swap) * (uint32_t)a) << 16);

This uses the fact that !x evaluates to 0 whenever swap is truthy and to 1 whenever swap is falsey, and so also !!x evaluates to 1 when x is truthy, even though x may not itself be 1. Multiplying by the result selects either a or b as appropriate. 它使用的事实!x取值为0时swap是truthy和1时swap是falsey,所以也!!x评估为1时x是truthy,即使x本身不能1.乘以结果选择ab视情况而定)。

Note, however, that instead of one compare and branch you now have multiple logical and arithmetic operations. 但是请注意,您现在具有多个逻辑和算术运算,而不是一个比较和分支。 It is not at all clear that that would provide a performance improvement in practice. 尚不清楚这在实践中是否可以提高性能。

Courtesy of @ChristianGibbons: 由@ChristianGibbons提供:

[Provided that a and b are guaranteed non-negative and less than 2 16 ,] you can simplify this approach substantially by removing the bitwise AND component and applying the multiplication to the shifts instead of to the arguments: [假设ab保证为非负且小于2 16 ,]您可以通过删除按位AND分量并将乘法应用于移位而不是对参数进行运算,从而大大简化此方法:

val = ((uint32_t) a << (16 * !swap)) | ((uint32_t)b << (16 * !!swap));

That stands a better chance of outperforming the original code (but is still by no means certain to do so), but in that case a more fair comparison would be with a version of the original that relies on the same properties of the inputs: 这样做有更好的机会胜过原始代码(但仍然不确定这样做),但是在那种情况下,将与原始版本依赖输入的相同属性进行更公平的比较:

uint32_t val;
if (swap) {
   val = (uint32_t)a | ((uint32_t)b << 16);
} else {
   val = (uint32_t)b | ((uint32_t)a << 16);

There us not too much to optimize 那里我们没有太多优化

Here you have two versions 这里有两个版本

typedef union
    uint16_t u16[2];
    uint32_t u32;

uint32_t foo(uint32_t a, uint32_t b, int swap)
    D32_t da = {.u32 = a}, db = {.u32 = b}, val;

        val.u16[0] = da.u16[1];
        val.u16[1] = db.u16[0];
        val.u16[0] = db.u16[1];
        val.u16[1] = da.u16[0];

    return val.u32;

uint32_t foo2(uint32_t a, uint32_t b, int swap)
    uint32_t val;
    if (swap) 
        val = ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16);
        val = ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);

    return val;

the generated code is almost the same. 生成的代码几乎相同。

clang: 铛:

foo:                                    # @foo
        mov     eax, edi
        test    edx, edx
        mov     ecx, esi
        cmove   ecx, edi
        cmove   eax, esi
        shrd    eax, ecx, 16
foo2:                                   # @foo2
        movzx   ecx, si
        movzx   eax, di
        shl     edi, 16
        or      edi, ecx
        shl     esi, 16
        or      eax, esi
        test    edx, edx
        cmove   eax, edi

gcc: gcc:

        test    edx, edx
        je      .L2
        shr     edi, 16
        mov     eax, esi
        mov     edx, edi
        sal     eax, 16
        mov     ax, dx
        shr     esi, 16
        mov     eax, edi
        mov     edx, esi
        sal     eax, 16
        mov     ax, dx
        test    edx, edx
        je      .L6
        movzx   eax, di
        sal     esi, 16
        or      eax, esi
        movzx   eax, si
        sal     edi, 16
        or      eax, edi

https://godbolt.org/z/F4zOnf https://godbolt.org/z/F4zOnf

As you see clang likes unions, gcc shifts. 如您所见,c喜欢工会,gcc转移了。

In a similar vein to John Bollinger's answer that avoids any branching, I came up with the following to try to reduce the amount of operations performed, especially multiplication. 与避免任何分支的John Bollinger的回答类似,我想出了以下方法来尝试减少执行的运算量,尤其是乘法。

uint8_t shift_mask = (uint8_t) !swap * 16;
val = ((uint32_t) a << (shift_mask)) | ((uint32_t)b << ( 16 ^ shift_mask  ));

Neither compiler actually even uses a multiplication instruction since the only multiplication here is by a power of two, so it just uses a simple left shift to construct the value that will be used to shift either a or b . 实际上,两个编译器都没有使用乘法指令,因为这里唯一的乘法是2的幂,因此它仅使用简单的左移来构造将用于移位ab

Dissassembly of original with Clang -O2 使用Clang -O2拆卸原件

0000000000000000 <cat>:
   0:   85 d2                   test   %edx,%edx
   2:   89 f0                   mov    %esi,%eax
   4:   66 0f 45 c7             cmovne %di,%ax
   8:   66 0f 45 fe             cmovne %si,%di
   c:   0f b7 c0                movzwl %ax,%eax
   f:   c1 e7 10                shl    $0x10,%edi
  12:   09 f8                   or     %edi,%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00 

Dissassembly of new version with Clang -O2 使用Clang -O2反汇编新版本

0000000000000000 <cat>:
   0:   80 f2 01                xor    $0x1,%dl
   3:   0f b6 ca                movzbl %dl,%ecx
   6:   c1 e1 04                shl    $0x4,%ecx
   9:   d3 e7                   shl    %cl,%edi
   b:   83 f1 10                xor    $0x10,%ecx
   e:   d3 e6                   shl    %cl,%esi
  10:   09 fe                   or     %edi,%esi
  12:   89 f0                   mov    %esi,%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00 

Disassembly of original version with gcc -O2 用gcc -O2拆卸原始版本

0000000000000000 <cat>:
   0:   84 d2                   test   %dl,%dl
   2:   75 0c                   jne    10 <cat+0x10>
   4:   89 f8                   mov    %edi,%eax
   6:   0f b7 f6                movzwl %si,%esi
   9:   c1 e0 10                shl    $0x10,%eax
   c:   09 f0                   or     %esi,%eax
   e:   c3                      retq   
   f:   90                      nop
  10:   89 f0                   mov    %esi,%eax
  12:   0f b7 ff                movzwl %di,%edi
  15:   c1 e0 10                shl    $0x10,%eax
  18:   09 f8                   or     %edi,%eax
  1a:   c3                      retq   

Disassembly of new version with gcc -O2 用gcc -O2拆卸新版本

0000000000000000 <cat>:
   0:   83 f2 01                xor    $0x1,%edx
   3:   0f b7 c6                movzwl %si,%eax
   6:   0f b7 ff                movzwl %di,%edi
   9:   c1 e2 04                shl    $0x4,%edx
   c:   89 d1                   mov    %edx,%ecx
   e:   83 f1 10                xor    $0x10,%ecx
  11:   d3 e0                   shl    %cl,%eax
  13:   89 d1                   mov    %edx,%ecx
  15:   d3 e7                   shl    %cl,%edi
  17:   09 f8                   or     %edi,%eax
  19:   c3                      retq   

EDIT: As John Bollinger pointed out, this solution was written under the assumption that a and b were unsigned values rendering the bit-masking redundant. 编辑:正如约翰·博林格(John Bollinger)指出的那样,此解决方案是在ab是无符号值的情况下编写的,从而使位掩码变得多余。 If this approach is to be used with signed values under 32-bits, then it would need modification: 如果此方法与32位以下的带符号值一起使用,则需要进行修改:

uint8_t shift_mask = (uint8_t) !swap * 16;
val = ((uint32_t) (a & 0xFFFF) << (shift_mask)) | ((uint32_t) (b & 0xFFFF) << ( 16 ^ shift_mask  ));

I won't go too far into the disassembly of this version, but here's the clang output at -O2: 我不会深入探讨该版本的反汇编,但是这是-O2的clang输出:

0000000000000000 <cat>:
   0:   80 f2 01                xor    $0x1,%dl
   3:   0f b6 ca                movzbl %dl,%ecx
   6:   c1 e1 04                shl    $0x4,%ecx
   9:   0f b7 d7                movzwl %di,%edx
   c:   d3 e2                   shl    %cl,%edx
   e:   0f b7 c6                movzwl %si,%eax
  11:   83 f1 10                xor    $0x10,%ecx
  14:   d3 e0                   shl    %cl,%eax
  16:   09 d0                   or     %edx,%eax
  18:   c3                      retq   
  19:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

In response to P__J__ in regards to performance versus his union solution, here is what clang spits out at -O3 for the version of this code that is safe for dealing with signed types: 为了回应P__J__在性能方面与他的联合解决方案有关的问题,以下是lang在-O3发出的关于此代码版本的信息,该版本可安全处理带符号的类型:

0000000000000000 <cat>:
   0:   85 d2                   test   %edx,%edx
   2:   89 f0                   mov    %esi,%eax
   4:   66 0f 45 c7             cmovne %di,%ax
   8:   66 0f 45 fe             cmovne %si,%di
   c:   0f b7 c0                movzwl %ax,%eax
   f:   c1 e7 10                shl    $0x10,%edi
  12:   09 f8                   or     %edi,%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00 

It is a bit closer to the union solution in total instructions, but does not use SHRD which, according to This answer, it takes 4 clocks to perform on an intel skylake processor and uses up several operation units. 在总指令中它更接近于联合解决方案,但是不使用SHRD,根据答案,在Intel Skylake处理器上执行需要4个时钟,并占用多个操作单元。 I'd be mildly curious how they would each actually perform. 我会很好奇地好奇他们各自的表现如何。

val = swap ? ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16) : ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);

This will achieve the "embedding" you ask for. 这将实现您要求的“嵌入”。 However, I don't recommend this as it makes readability worse and no runtime optimization. 但是,我不建议这样做,因为它会使可读性变差并且没有运行时优化。

Compile with -O3 . -O3编译。 GCC and Clang have slightly different strategies for 64-bit processors. 对于64位处理器, GCCClang的策略略有不同。 GCC generates code with branch whereas Clang will run both branches and then use conditional move. GCC使用分支生成代码,而Clang将同时运行两个分支,然后使用条件移动。 Both GCC and Clang will generate a "zero-extend short to int" instruction instead of and . GCC和Clang都将生成“零扩展到int的短整数”指令,而不是and

Using ?: didn't change the generated code in either. 使用?:没有改变生成的代码。

The Clang version does seem more efficient. Clang版本似乎确实更有效。

All in all, both would generate the same code if you didn't need the swap. 总而言之, 如果您不需要交换,两者都会生成相同的代码。

