优化32位值的构造

Question

So, I have the following code: 因此，我有以下代码：

uint32_t val;
if (swap) {
   val = ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16);
} else {
   val = ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);
}

Is there a way to optimize it, and have swap checking somehow embedded in the statement? 有没有一种方法可以对其进行优化，并以某种方式将swap检查嵌入到语句中？

Answer 1

If the objective is to avoid a branch, then you can write this: 如果目标是避免分支，则可以编写以下代码：

val = ((!!swap) * (uint32_t)a + (!swap) * (uint32_t)b) & 0x0000ffff)
        | (((!!swap) * (uint32_t)b + (!swap) * (uint32_t)a) << 16);

This uses the fact that !x evaluates to 0 whenever swap is truthy and to 1 whenever swap is falsey, and so also !!x evaluates to 1 when x is truthy, even though x may not itself be 1. Multiplying by the result selects either a or b as appropriate. 它使用的事实!x取值为0时swap是truthy和1时swap是falsey，所以也!!x评估为1时x是truthy，即使x本身不能1.乘以结果选择a或b视情况而定）。

Note, however, that instead of one compare and branch you now have multiple logical and arithmetic operations. 但是请注意，您现在具有多个逻辑和算术运算，而不是一个比较和分支。 It is not at all clear that that would provide a performance improvement in practice. 尚不清楚这在实践中是否可以提高性能。

Courtesy of @ChristianGibbons: 由@ChristianGibbons提供：

[Provided that a and b are guaranteed non-negative and less than 2 ¹⁶ ,] you can simplify this approach substantially by removing the bitwise AND component and applying the multiplication to the shifts instead of to the arguments: [假设a和b保证为非负且小于2 ¹⁶ ，]您可以通过删除按位AND分量并将乘法应用于移位而不是对参数进行运算，从而大大简化此方法：

val = ((uint32_t) a << (16 * !swap)) | ((uint32_t)b << (16 * !!swap));

That stands a better chance of outperforming the original code (but is still by no means certain to do so), but in that case a more fair comparison would be with a version of the original that relies on the same properties of the inputs: 这样做有更好的机会胜过原始代码（但仍然不确定这样做），但是在那种情况下，将与原始版本依赖输入的相同属性进行更公平的比较：

uint32_t val;
if (swap) {
   val = (uint32_t)a | ((uint32_t)b << 16);
} else {
   val = (uint32_t)b | ((uint32_t)a << 16);
}

Answer 2

There us not too much to optimize 那里我们没有太多优化

Here you have two versions 这里有两个版本

typedef union
{
    uint16_t u16[2];
    uint32_t u32;
}D32_t;


uint32_t foo(uint32_t a, uint32_t b, int swap)
{
    D32_t da = {.u32 = a}, db = {.u32 = b}, val;

    if(swap)
    {
        val.u16[0] = da.u16[1];
        val.u16[1] = db.u16[0];
    }
    else
    {
        val.u16[0] = db.u16[1];
        val.u16[1] = da.u16[0];
    }

    return val.u32;
}


uint32_t foo2(uint32_t a, uint32_t b, int swap)
{
    uint32_t val;
    if (swap) 
    {
        val = ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16);
    } 
    else 
    {
        val = ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);
    }

    return val;
}

the generated code is almost the same. 生成的代码几乎相同。

clang: 铛：

foo:                                    # @foo
        mov     eax, edi
        test    edx, edx
        mov     ecx, esi
        cmove   ecx, edi
        cmove   eax, esi
        shrd    eax, ecx, 16
        ret
foo2:                                   # @foo2
        movzx   ecx, si
        movzx   eax, di
        shl     edi, 16
        or      edi, ecx
        shl     esi, 16
        or      eax, esi
        test    edx, edx
        cmove   eax, edi
        ret

gcc: gcc：

foo:
        test    edx, edx
        je      .L2
        shr     edi, 16
        mov     eax, esi
        mov     edx, edi
        sal     eax, 16
        mov     ax, dx
        ret
.L2:
        shr     esi, 16
        mov     eax, edi
        mov     edx, esi
        sal     eax, 16
        mov     ax, dx
        ret
foo2:
        test    edx, edx
        je      .L6
        movzx   eax, di
        sal     esi, 16
        or      eax, esi
        ret
.L6:
        movzx   eax, si
        sal     edi, 16
        or      eax, edi
        ret

https://godbolt.org/z/F4zOnf https://godbolt.org/z/F4zOnf

As you see clang likes unions, gcc shifts. 如您所见，c喜欢工会，gcc转移了。

Answer 3

In a similar vein to John Bollinger's answer that avoids any branching, I came up with the following to try to reduce the amount of operations performed, especially multiplication. 与避免任何分支的John Bollinger的回答类似，我想出了以下方法来尝试减少执行的运算量，尤其是乘法。

uint8_t shift_mask = (uint8_t) !swap * 16;
val = ((uint32_t) a << (shift_mask)) | ((uint32_t)b << ( 16 ^ shift_mask  ));

Neither compiler actually even uses a multiplication instruction since the only multiplication here is by a power of two, so it just uses a simple left shift to construct the value that will be used to shift either a or b . 实际上，两个编译器都没有使用乘法指令，因为这里唯一的乘法是2的幂，因此它仅使用简单的左移来构造将用于移位a或b 。

Dissassembly of original with Clang -O2 使用Clang -O2拆卸原件

0000000000000000 <cat>:
   0:   85 d2                   test   %edx,%edx
   2:   89 f0                   mov    %esi,%eax
   4:   66 0f 45 c7             cmovne %di,%ax
   8:   66 0f 45 fe             cmovne %si,%di
   c:   0f b7 c0                movzwl %ax,%eax
   f:   c1 e7 10                shl    $0x10,%edi
  12:   09 f8                   or     %edi,%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00

Dissassembly of new version with Clang -O2 使用Clang -O2反汇编新版本

0000000000000000 <cat>:
   0:   80 f2 01                xor    $0x1,%dl
   3:   0f b6 ca                movzbl %dl,%ecx
   6:   c1 e1 04                shl    $0x4,%ecx
   9:   d3 e7                   shl    %cl,%edi
   b:   83 f1 10                xor    $0x10,%ecx
   e:   d3 e6                   shl    %cl,%esi
  10:   09 fe                   or     %edi,%esi
  12:   89 f0                   mov    %esi,%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00

Disassembly of original version with gcc -O2 用gcc -O2拆卸原始版本

0000000000000000 <cat>:
   0:   84 d2                   test   %dl,%dl
   2:   75 0c                   jne    10 <cat+0x10>
   4:   89 f8                   mov    %edi,%eax
   6:   0f b7 f6                movzwl %si,%esi
   9:   c1 e0 10                shl    $0x10,%eax
   c:   09 f0                   or     %esi,%eax
   e:   c3                      retq   
   f:   90                      nop
  10:   89 f0                   mov    %esi,%eax
  12:   0f b7 ff                movzwl %di,%edi
  15:   c1 e0 10                shl    $0x10,%eax
  18:   09 f8                   or     %edi,%eax
  1a:   c3                      retq

Disassembly of new version with gcc -O2 用gcc -O2拆卸新版本

0000000000000000 <cat>:
   0:   83 f2 01                xor    $0x1,%edx
   3:   0f b7 c6                movzwl %si,%eax
   6:   0f b7 ff                movzwl %di,%edi
   9:   c1 e2 04                shl    $0x4,%edx
   c:   89 d1                   mov    %edx,%ecx
   e:   83 f1 10                xor    $0x10,%ecx
  11:   d3 e0                   shl    %cl,%eax
  13:   89 d1                   mov    %edx,%ecx
  15:   d3 e7                   shl    %cl,%edi
  17:   09 f8                   or     %edi,%eax
  19:   c3                      retq

EDIT: As John Bollinger pointed out, this solution was written under the assumption that a and b were unsigned values rendering the bit-masking redundant. 编辑：正如约翰·博林格（John Bollinger）指出的那样，此解决方案是在a和b是无符号值的情况下编写的，从而使位掩码变得多余。 If this approach is to be used with signed values under 32-bits, then it would need modification: 如果此方法与32位以下的带符号值一起使用，则需要进行修改：

uint8_t shift_mask = (uint8_t) !swap * 16;
val = ((uint32_t) (a & 0xFFFF) << (shift_mask)) | ((uint32_t) (b & 0xFFFF) << ( 16 ^ shift_mask  ));

I won't go too far into the disassembly of this version, but here's the clang output at -O2: 我不会深入探讨该版本的反汇编，但是这是-O2的clang输出：

0000000000000000 <cat>:
   0:   80 f2 01                xor    $0x1,%dl
   3:   0f b6 ca                movzbl %dl,%ecx
   6:   c1 e1 04                shl    $0x4,%ecx
   9:   0f b7 d7                movzwl %di,%edx
   c:   d3 e2                   shl    %cl,%edx
   e:   0f b7 c6                movzwl %si,%eax
  11:   83 f1 10                xor    $0x10,%ecx
  14:   d3 e0                   shl    %cl,%eax
  16:   09 d0                   or     %edx,%eax
  18:   c3                      retq   
  19:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

In response to P__J__ in regards to performance versus his union solution, here is what clang spits out at -O3 for the version of this code that is safe for dealing with signed types: 为了回应P__J__在性能方面与他的联合解决方案有关的问题，以下是lang在-O3发出的关于此代码版本的信息，该版本可安全处理带符号的类型：

0000000000000000 <cat>:
   0:   85 d2                   test   %edx,%edx
   2:   89 f0                   mov    %esi,%eax
   4:   66 0f 45 c7             cmovne %di,%ax
   8:   66 0f 45 fe             cmovne %si,%di
   c:   0f b7 c0                movzwl %ax,%eax
   f:   c1 e7 10                shl    $0x10,%edi
  12:   09 f8                   or     %edi,%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00

It is a bit closer to the union solution in total instructions, but does not use SHRD which, according to This answer, it takes 4 clocks to perform on an intel skylake processor and uses up several operation units. 在总指令中它更接近于联合解决方案，但是不使用SHRD，根据此答案，在Intel Skylake处理器上执行需要4个时钟，并占用多个操作单元。 I'd be mildly curious how they would each actually perform. 我会很好奇地好奇他们各自的表现如何。

Answer 4

val = swap ? ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16) : ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);

This will achieve the "embedding" you ask for. 这将实现您要求的“嵌入”。 However, I don't recommend this as it makes readability worse and no runtime optimization. 但是，我不建议这样做，因为它会使可读性变差并且没有运行时优化。

Answer 5

Compile with -O3 . 用-O3编译。 GCC and Clang have slightly different strategies for 64-bit processors. 对于64位处理器， GCC和Clang的策略略有不同。 GCC generates code with branch whereas Clang will run both branches and then use conditional move. GCC使用分支生成代码，而Clang将同时运行两个分支，然后使用条件移动。 Both GCC and Clang will generate a "zero-extend short to int" instruction instead of and . GCC和Clang都将生成“零扩展到int的短整数”指令，而不是and 。

Using ?: didn't change the generated code in either. 使用?:没有改变生成的代码。

The Clang version does seem more efficient. Clang版本似乎确实更有效。

All in all, both would generate the same code if you didn't need the swap. 总而言之，如果您不需要交换，两者都会生成相同的代码。

优化32位值的构造

问题描述

5 个解决方案

解决方案1
2 2019-04-22 19:27:45

解决方案2
1 2019-04-22 19:21:12

解决方案3
1 2019-04-22 20:16:20

解决方案4
0 2019-04-22 19:20:24

解决方案5
0 2019-04-22 19:20:41

优化32位值的构造

问题描述

5 个解决方案

解决方案1 2 2019-04-22 19:27:45

解决方案2 1 2019-04-22 19:21:12

解决方案3 1 2019-04-22 20:16:20

解决方案4 0 2019-04-22 19:20:24

解决方案5 0 2019-04-22 19:20:41

解决方案1
2 2019-04-22 19:27:45

解决方案2
1 2019-04-22 19:21:12

解决方案3
1 2019-04-22 20:16:20

解决方案4
0 2019-04-22 19:20:24

解决方案5
0 2019-04-22 19:20:41