[英]optimize 32-bit value construction
So, I have the following code: 因此,我有以下代码:
uint32_t val;
if (swap) {
val = ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16);
} else {
val = ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);
}
Is there a way to optimize it, and have swap
checking somehow embedded in the statement? 有没有一种方法可以对其进行优化,并以某种方式将swap
检查嵌入到语句中?
If the objective is to avoid a branch, then you can write this: 如果目标是避免分支,则可以编写以下代码:
val = ((!!swap) * (uint32_t)a + (!swap) * (uint32_t)b) & 0x0000ffff)
| (((!!swap) * (uint32_t)b + (!swap) * (uint32_t)a) << 16);
This uses the fact that !x
evaluates to 0 whenever swap
is truthy and to 1 whenever swap
is falsey, and so also !!x
evaluates to 1 when x
is truthy, even though x
may not itself be 1. Multiplying by the result selects either a
or b
as appropriate. 它使用的事实!x
取值为0时swap
是truthy和1时swap
是falsey,所以也!!x
评估为1时x
是truthy,即使x
本身不能1.乘以结果选择a
或b
视情况而定)。
Note, however, that instead of one compare and branch you now have multiple logical and arithmetic operations. 但是请注意,您现在具有多个逻辑和算术运算,而不是一个比较和分支。 It is not at all clear that that would provide a performance improvement in practice. 尚不清楚这在实践中是否可以提高性能。
Courtesy of @ChristianGibbons: 由@ChristianGibbons提供:
[Provided that a
and b
are guaranteed non-negative and less than 2 16 ,] you can simplify this approach substantially by removing the bitwise AND component and applying the multiplication to the shifts instead of to the arguments: [假设a
和b
保证为非负且小于2 16 ,]您可以通过删除按位AND分量并将乘法应用于移位而不是对参数进行运算,从而大大简化此方法:
val = ((uint32_t) a << (16 * !swap)) | ((uint32_t)b << (16 * !!swap));
That stands a better chance of outperforming the original code (but is still by no means certain to do so), but in that case a more fair comparison would be with a version of the original that relies on the same properties of the inputs: 这样做有更好的机会胜过原始代码(但仍然不确定这样做),但是在那种情况下,将与原始版本依赖输入的相同属性进行更公平的比较:
uint32_t val;
if (swap) {
val = (uint32_t)a | ((uint32_t)b << 16);
} else {
val = (uint32_t)b | ((uint32_t)a << 16);
}
There us not too much to optimize 那里我们没有太多优化
Here you have two versions 这里有两个版本
typedef union
{
uint16_t u16[2];
uint32_t u32;
}D32_t;
uint32_t foo(uint32_t a, uint32_t b, int swap)
{
D32_t da = {.u32 = a}, db = {.u32 = b}, val;
if(swap)
{
val.u16[0] = da.u16[1];
val.u16[1] = db.u16[0];
}
else
{
val.u16[0] = db.u16[1];
val.u16[1] = da.u16[0];
}
return val.u32;
}
uint32_t foo2(uint32_t a, uint32_t b, int swap)
{
uint32_t val;
if (swap)
{
val = ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16);
}
else
{
val = ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);
}
return val;
}
the generated code is almost the same. 生成的代码几乎相同。
clang: 铛:
foo: # @foo
mov eax, edi
test edx, edx
mov ecx, esi
cmove ecx, edi
cmove eax, esi
shrd eax, ecx, 16
ret
foo2: # @foo2
movzx ecx, si
movzx eax, di
shl edi, 16
or edi, ecx
shl esi, 16
or eax, esi
test edx, edx
cmove eax, edi
ret
gcc: gcc:
foo:
test edx, edx
je .L2
shr edi, 16
mov eax, esi
mov edx, edi
sal eax, 16
mov ax, dx
ret
.L2:
shr esi, 16
mov eax, edi
mov edx, esi
sal eax, 16
mov ax, dx
ret
foo2:
test edx, edx
je .L6
movzx eax, di
sal esi, 16
or eax, esi
ret
.L6:
movzx eax, si
sal edi, 16
or eax, edi
ret
https://godbolt.org/z/F4zOnf https://godbolt.org/z/F4zOnf
As you see clang likes unions, gcc shifts. 如您所见,c喜欢工会,gcc转移了。
In a similar vein to John Bollinger's answer that avoids any branching, I came up with the following to try to reduce the amount of operations performed, especially multiplication. 与避免任何分支的John Bollinger的回答类似,我想出了以下方法来尝试减少执行的运算量,尤其是乘法。
uint8_t shift_mask = (uint8_t) !swap * 16;
val = ((uint32_t) a << (shift_mask)) | ((uint32_t)b << ( 16 ^ shift_mask ));
Neither compiler actually even uses a multiplication instruction since the only multiplication here is by a power of two, so it just uses a simple left shift to construct the value that will be used to shift either a
or b
. 实际上,两个编译器都没有使用乘法指令,因为这里唯一的乘法是2的幂,因此它仅使用简单的左移来构造将用于移位a
或b
。
Dissassembly of original with Clang -O2 使用Clang -O2拆卸原件
0000000000000000 <cat>:
0: 85 d2 test %edx,%edx
2: 89 f0 mov %esi,%eax
4: 66 0f 45 c7 cmovne %di,%ax
8: 66 0f 45 fe cmovne %si,%di
c: 0f b7 c0 movzwl %ax,%eax
f: c1 e7 10 shl $0x10,%edi
12: 09 f8 or %edi,%eax
14: c3 retq
15: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
1c: 00 00 00 00
Dissassembly of new version with Clang -O2 使用Clang -O2反汇编新版本
0000000000000000 <cat>:
0: 80 f2 01 xor $0x1,%dl
3: 0f b6 ca movzbl %dl,%ecx
6: c1 e1 04 shl $0x4,%ecx
9: d3 e7 shl %cl,%edi
b: 83 f1 10 xor $0x10,%ecx
e: d3 e6 shl %cl,%esi
10: 09 fe or %edi,%esi
12: 89 f0 mov %esi,%eax
14: c3 retq
15: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
1c: 00 00 00 00
Disassembly of original version with gcc -O2 用gcc -O2拆卸原始版本
0000000000000000 <cat>:
0: 84 d2 test %dl,%dl
2: 75 0c jne 10 <cat+0x10>
4: 89 f8 mov %edi,%eax
6: 0f b7 f6 movzwl %si,%esi
9: c1 e0 10 shl $0x10,%eax
c: 09 f0 or %esi,%eax
e: c3 retq
f: 90 nop
10: 89 f0 mov %esi,%eax
12: 0f b7 ff movzwl %di,%edi
15: c1 e0 10 shl $0x10,%eax
18: 09 f8 or %edi,%eax
1a: c3 retq
Disassembly of new version with gcc -O2 用gcc -O2拆卸新版本
0000000000000000 <cat>:
0: 83 f2 01 xor $0x1,%edx
3: 0f b7 c6 movzwl %si,%eax
6: 0f b7 ff movzwl %di,%edi
9: c1 e2 04 shl $0x4,%edx
c: 89 d1 mov %edx,%ecx
e: 83 f1 10 xor $0x10,%ecx
11: d3 e0 shl %cl,%eax
13: 89 d1 mov %edx,%ecx
15: d3 e7 shl %cl,%edi
17: 09 f8 or %edi,%eax
19: c3 retq
EDIT: As John Bollinger pointed out, this solution was written under the assumption that a
and b
were unsigned values rendering the bit-masking redundant. 编辑:正如约翰·博林格(John Bollinger)指出的那样,此解决方案是在a
和b
是无符号值的情况下编写的,从而使位掩码变得多余。 If this approach is to be used with signed values under 32-bits, then it would need modification: 如果此方法与32位以下的带符号值一起使用,则需要进行修改:
uint8_t shift_mask = (uint8_t) !swap * 16;
val = ((uint32_t) (a & 0xFFFF) << (shift_mask)) | ((uint32_t) (b & 0xFFFF) << ( 16 ^ shift_mask ));
I won't go too far into the disassembly of this version, but here's the clang output at -O2: 我不会深入探讨该版本的反汇编,但是这是-O2的clang输出:
0000000000000000 <cat>:
0: 80 f2 01 xor $0x1,%dl
3: 0f b6 ca movzbl %dl,%ecx
6: c1 e1 04 shl $0x4,%ecx
9: 0f b7 d7 movzwl %di,%edx
c: d3 e2 shl %cl,%edx
e: 0f b7 c6 movzwl %si,%eax
11: 83 f1 10 xor $0x10,%ecx
14: d3 e0 shl %cl,%eax
16: 09 d0 or %edx,%eax
18: c3 retq
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
In response to P__J__ in regards to performance versus his union solution, here is what clang spits out at -O3
for the version of this code that is safe for dealing with signed types: 为了回应P__J__在性能方面与他的联合解决方案有关的问题,以下是lang在-O3
发出的关于此代码版本的信息,该版本可安全处理带符号的类型:
0000000000000000 <cat>:
0: 85 d2 test %edx,%edx
2: 89 f0 mov %esi,%eax
4: 66 0f 45 c7 cmovne %di,%ax
8: 66 0f 45 fe cmovne %si,%di
c: 0f b7 c0 movzwl %ax,%eax
f: c1 e7 10 shl $0x10,%edi
12: 09 f8 or %edi,%eax
14: c3 retq
15: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
1c: 00 00 00 00
It is a bit closer to the union solution in total instructions, but does not use SHRD which, according to This answer, it takes 4 clocks to perform on an intel skylake processor and uses up several operation units. 在总指令中它更接近于联合解决方案,但是不使用SHRD,根据此答案,在Intel Skylake处理器上执行需要4个时钟,并占用多个操作单元。 I'd be mildly curious how they would each actually perform. 我会很好奇地好奇他们各自的表现如何。
val = swap ? ((uint32_t)a & 0x0000ffff) | ((uint32_t)b << 16) : ((uint32_t)b & 0x0000ffff) | ((uint32_t)a << 16);
This will achieve the "embedding" you ask for. 这将实现您要求的“嵌入”。 However, I don't recommend this as it makes readability worse and no runtime optimization. 但是,我不建议这样做,因为它会使可读性变差并且没有运行时优化。
Compile with -O3
. 用-O3
编译。 GCC and Clang have slightly different strategies for 64-bit processors. 对于64位处理器, GCC和Clang的策略略有不同。 GCC generates code with branch whereas Clang will run both branches and then use conditional move. GCC使用分支生成代码,而Clang将同时运行两个分支,然后使用条件移动。 Both GCC and Clang will generate a "zero-extend short to int" instruction instead of and
. GCC和Clang都将生成“零扩展到int的短整数”指令,而不是and
。
Using ?:
didn't change the generated code in either. 使用?:
没有改变生成的代码。
The Clang version does seem more efficient. Clang版本似乎确实更有效。
All in all, both would generate the same code if you didn't need the swap. 总而言之, 如果您不需要交换,两者都会生成相同的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.