[英]Signed saturated add of 64-bit ints?
I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. 我正在寻找一些用于签名饱和64位加法的C代码,它使用gcc优化器编译为高效的x86-64代码。 Portable code would be ideal, although an asm solution could be used if necessary.
便携式代码是理想的,尽管如果需要可以使用asm解决方案。
static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;
int64 signed_saturated_add(int64 x, int64 y) {
bool x_is_negative = (x & kint64min) != 0;
bool y_is_negative = (y & kint64min) != 0;
int64 sum = x+y;
bool sum_is_negative = (sum & kint64min) != 0;
if (x_is_negative != y_is_negative) return sum; // can't overflow
if (x_is_negative && !sum_is_negative) return kint64min;
if (!x_is_negative && sum_is_negative) return kint64max;
return sum;
}
The function as written produces a fairly lengthy assembly output with several branches. 写入的函数产生具有多个分支的相当长的汇编输出。 Any tips on optimization?
有关优化的提示吗? Seems like it ought to be be implementable with just an
ADD
with a few CMOV
instructions but I'm a little bit rusty with this stuff. 看起来它应该只用一个带有一些
CMOV
指令的ADD
来实现,但我对这些东西有点生疏。
This may be optimized further but here is a portable solution. 这可以进一步优化,但这是一个便携式解决方案。 It does not invoked undefined behavior and it checks for integer overflow before it could occur.
它不会调用未定义的行为,它会在可能发生之前检查整数溢出。
#include <stdint.h>
int64_t sadd64(int64_t a, int64_t b)
{
if (a > 0) {
if (b > INT64_MAX - a) {
return INT64_MAX;
}
} else if (b < INT64_MIN - a) {
return INT64_MIN;
}
return a + b;
}
This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. 这是一个在其中一条评论中一直延续下去的解决方案,并且也在ouah的解决方案中使用过。 here the generated code should be without conditional jumps
这里生成的代码应该没有条件跳转
int64_t signed_saturated_add(int64_t x, int64_t y) {
// determine the lower or upper bound of the result
int64_t ret = (x < 0) ? INT64_MIN : INT64_MAX;
// this is always well defined:
// if x < 0 this adds a positive value to INT64_MIN
// if x > 0 this subtracts a positive value from INT64_MAX
int64_t comp = ret - x;
// the condition is equivalent to
// ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
if ((x < 0) == (y > comp)) ret = x + y;
return ret;
}
The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN
is INT64_MAX+1
. 第一个看起来好像会有一个条件移动,但由于特殊值,我的编译器得到一个加法:在2的补码
INT64_MIN
是INT64_MAX+1
。 There is then only one conditional move for the assignment of the sum, in case anything is fine. 如果一切正常,那么只有一个有条件的移动来分配总和。
All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow. 所有这些都没有UB,因为在抽象状态机中,只有在没有溢出时才会执行求和。
Related: unsigned
saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C? 相关:
unsigned
饱和度更容易,并且在纯ISO C中有效: 如何在C中进行无符号饱和加法?
Compilers are terrible at all of the pure C options proposed so far. 编译器在目前提出的所有纯C选项中都很糟糕 。
They don't see that they can use the signed-overflow flag result from an add
instruction to detect that saturation to INT64_MIN/MAX is necessary. 他们没有看到他们可以使用
add
指令的signed-overflow标志结果来检测是否需要INT64_MIN / MAX饱和。 AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add
. AFAIK没有编译器认为是读取
add
的OF标志结果的纯C模式。
Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. 内联asm在这里并不是一个坏主意,但是我们可以避免使用GCC的内置函数来公开UB安全包装带有布尔溢出结果的带符号加法。 https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.) (如果你打算使用GNU C inline asm,这将限制你和这些GNU C内置程序一样多。而且这些内置程序不是特定于arch的。它们确实需要gcc5或更新,但是gcc4.9和更早版本基本上都是过时的.https ://gcc.gnu.org/wiki/DontUseInlineAsm - 它会破坏不断的传播并且难以维护。)
This version uses the fact that INT64_MIN = INT64_MAX + 1ULL
(for 2's complement) to select INT64_MIN/MAX based on the sign of b
. 此版本使用
INT64_MIN = INT64_MAX + 1ULL
(2的补码)的事实,根据b
的符号选择INT64_MIN / MAX。 Signed-overflow UB is avoided by using uint64_t
for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). 通过对该添加使用
uint64_t
来避免有符号溢出UB,并且GNU C定义了将无符号整数转换为无法表示其值的有符号类型的行为(未使用的位模式)。 Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX
目前的gcc / clang受益于这个手持,因为他们没有从三元组
(b<0) ? INT64_MIN : INT64_MAX
找出这个技巧(b<0) ? INT64_MIN : INT64_MAX
(b<0) ? INT64_MIN : INT64_MAX
. (b<0) ? INT64_MIN : INT64_MAX
。 (See below for the alternate version using that). (请参阅下面的替代版本)。 I haven't checked the asm on 32-bit architectures.
我没有在32位架构上检查过asm。
GCC only supports 2's complement integer types , so a function using __builtin_add_overflow
doesn't have to care about portability to C implementations that use 1's complement ( where the same identity holds ) or sign/magnitude (where it doesn't), even if you made a version for long
or int
instead of int64_t
. GCC仅支持2的补码整数类型 ,因此使用
__builtin_add_overflow
的函数不必关心使用1的补码( 其中相同的标识保持 )或符号/幅度(其中没有)的C实现的可移植性,即使您为long
或int
而不是int64_t
创建了一个版本。
#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif
// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction depending on the sign bit
return ((uint64_t)b >> 63) + INT64_MAX;
// INT64_MIN = INT64_MAX + 1 wraparound, done with unsigned
}
return res;
}
Another option is (b>>63) ^ INT64_MAX
which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. 另一个选项是
(b>>63) ^ INT64_MAX
,如果手动矢量化,SIMD XOR可以在比SIMD ADD更多的端口上运行,如Skylake之前的英特尔,可能会很有用。 (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t
version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd
) See Add saturate 32-bit signed ints intrinsics? (但是x86没有SIMD 64位算术右移,只有逻辑,所以这只会对
int32_t
版本有所帮助,并且你需要首先有效地检测溢出。或者你可以使用变量混合符号位,如blendvpd
)请参阅添加饱和32位有符号整数内在函数? with x86 SIMD intrinsics (SSE2/SSE4) 使用x86 SIMD内在函数(SSE2 / SSE4)
On Godbolt with gcc9 and clang8 (along with the other implementations from other answers): 在带有gcc9和clang8的Godbolt上 (以及其他答案中的其他实现):
# gcc9.1 -O3 (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rax, 9223372036854775807 # INT64_MAX
shr rsi, 63 # b is still available after the ADD
add rax, rsi
ret
When inlining into a loop, the mov imm64
can be hoisted. 内联到循环中时,可以提升
mov imm64
。 If b
is needed afterwards then we might need an extra mov
, otherwise shr
/ add
can destroy b
, leaving the INT64_MAX
constant in a register undamaged. 如果之后需要
b
,那么我们可能需要一个额外的mov
,否则shr
/ add
会破坏b
,使INT64_MAX
在寄存器中保持不变。 Or if the compiler wants to use cmov
(like clang does), it has to mov
/ shr
because it has to get the saturation constant ready before the add, preserving both operands. 或者如果编译器想要使用
cmov
(就像clang那样),它必须是mov
/ shr
因为它必须在add 之前准备饱和度常量,保留两个操作数。
Notice that the critical path for the non-overflowing case only includes an add
and a not-taken jo
. 请注意, 非溢出情况的关键路径仅包括
add
和not-taken jo
。 These can't macro-fuse into a single uop even on Sandybridge-family , but the jo
only costs throughput not latency thanks to branch prediction + speculative execution. 即使在Sandybridge系列中 ,这些也不能宏观融合到单个uop中 ,但由于分支预测+推测执行,
jo
只会花费吞吐量而不是延迟。 (When inlining, the mov
will go away.) (内联时,
mov
会消失。)
If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno
instead of jo
, like clang does. 如果饱和度实际上并不罕见并且分支预测是一个问题,那么使用配置文件引导优化进行编译, gcc将有希望进行if-conversion并使用
cmovno
而不是jo
,就像clang一样。 This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. 这将MIN / MAX选择放在关键路径上,以及CMOV本身。 The MIN/MAX selection can run in parallel with the
add
. MIN / MAX选项可以与
add
并行运行。
You could use a<0
instead. 您可以使用
a<0
代替。 I used b
because I think most people would write x = sadd(x, 123)
instead of x = sadd(123, x)
, and having a compile-time-constant allows the b<0
to optimize away . 我使用
b
因为我认为大多数人会写x = sadd(x, 123)
而不是x = sadd(123, x)
,并且有一个编译时常量允许b<0
来优化 。 For maximal optimization opportunity, you could use if (__builtin_constant_p(a))
to ask the compiler if a
was a compile-time constant. 对于最大的优化机会,你可以使用
if (__builtin_constant_p(a))
问编译器,如果a
是一个编译时间常数。 That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. 这适用于gcc,但是clang在内联之前过早地评估了const-ness方式,所以除了带有clang的宏之外它没用。 (Related: ICC19 doesn't do constant propagation through
__builtin_saddll_overflow
: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.) (相关:ICC19不通过
__builtin_saddll_overflow
进行常量传播:它将两个输入都放在寄存器中并仍然执行添加.GCC和Clang只返回一个常量。)
This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add
+ cmovo
. 这个优化在提升MIN / MAX选择的循环中特别有用,只留下
add
+ cmovo
。 (Or add
+ jo
to a mov
.) (或者
add
+ jo
add
到mov
。)
cmov
is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. cmov
是Broadwell之前针对Intel P6系列和SnB系列的2 cmov
指令,因为它有3个输入。 On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. 在其他x86 CPU(Broadwell / Skylake和AMD)上,它是一个单指令。 On most such CPUs it has 1 cycle latency.
在大多数此类CPU上,它具有1个周期延迟。 It's a simple ALU select operation;
这是一个简单的ALU选择操作; the only complication is 3 inputs (2 regs + FLAGS).
唯一的复杂因素是3输入(2 regs + FLAGS)。 But on KNL it's still 2-cycle latency.
但是在KNL上,它仍然是2周期延迟。
Unfortunately gcc for AArch64 fails to use adds
to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch. 不幸的是,AArch64的gcc无法使用
adds
设置标志并检查V(溢出)标志结果,因此它花费了几条指令决定是否分支。
Clang does a great job , and AArch64's immediate encodings can represent INT64_MAX
as an operand to eor
or add
. Clang做得很好 ,AArch64的即时编码可以将
INT64_MAX
表示为eor
或add
的操作数。
// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
orr x9, xzr, #0x7fffffffffffffff // mov constant = OR with zero reg
adds x8, x0, x1 // add and set flags
add x9, x9, x1, lsr #63 // sat = (b shr 63) + MAX
csel x0, x9, x8, vs // conditional-select, condition = VS = oVerflow flag Set
ret
MIN
vs. MAX
selection MIN
与MAX
选择 As noted above, return (b<0) ? INT64_MIN : INT64_MAX;
如上所述,
return (b<0) ? INT64_MIN : INT64_MAX;
return (b<0) ? INT64_MIN : INT64_MAX;
doesn't compile optimally with most versions of gcc/clang; 不能用大多数版本的gcc / clang进行最佳编译; they generate both constant in registers and cmov to select, or something similar on other ISAs.
它们在寄存器和cmov中生成常量以进行选择,或者在其他ISA上生成类似的东西。
We can assume 2's complement because GCC only supports 2's complement integer types , and because the ISO C optional int64_t
type is guaranteed to be 2's complement if it exists. 我们可以假设2的补码,因为GCC只支持2的补码整数类型 ,并且因为ISO C可选的
int64_t
类型如果存在则保证是2的补码。 (Signed overflow of int64_t
is still UB, this allows it to be a simple typedef
of long
or long long
). (
int64_t
有符号溢出仍然是UB,这允许它是long
或long long
的简单typedef
)。
(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow
, a version of this function for long long
or int
couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF...
to OR the sign bit of b
into a max-magnitude number.) (在支持一些等效的
__builtin_add_overflow
的符号/大小C实现上, long long
函数或int
的这个函数的一个版本不能使用SHR / ADD技巧。为了极端的可移植性,你可能只使用简单的三元组,或者用于符号/幅度具体你可以return (b&0x800...) | 0x7FFF...
将b
的符号位转换为最大幅度数。)
For two's complement, the bit-patterns for MIN and MAX are 0x8000...
(just the high bit set) and 0x7FFF...
(all other bits set). 对于二进制补码,MIN和MAX的位模式为
0x8000...
(仅高位设置)和0x7FFF...
(所有其他位设置)。 They have a couple interesting properties: MIN = MAX + 1
(if computed with unsigned on the bit-pattern), and MIN = ~MAX
: their bit-patterns are bitwise inverses, aka one's complement of each other. 它们有几个有趣的属性:
MIN = MAX + 1
(如果在位模式上使用无符号计算), MIN = ~MAX
:它们的位模式是按位反转,也就是彼此的一个补码。
The MIN = ~MAX
property follows from ~x = -x - 1
(a re-arrangement of the standard -x = ~x + 1
2's complement identity ) and the fact that MIN = -MAX - 1
. MIN = ~MAX
属性遵循~x = -x - 1
(标准-x = ~x + 1
2的补码标识的重新排列)和MIN = -MAX - 1
的事实。 The +1
property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well . +1
属性是不相关的,并且从简单翻转从最正面到最负面,并且也适用于有符号整数的一个补码编码 。 (But not sign/magnitude; you'd get -0
where the unsigned magnitude ). (但不是符号/幅度;你得到
-0
无符号幅度)。
The above function uses the MIN = MAX + 1
trick. 上面的函数使用
MIN = MAX + 1
技巧。 The MIN = ~MAX
trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0
or 0xFF...
), and XORing with that. MIN = ~MAX
技巧也可用于通过算术右移(创建0
或0xFF...
)将符号位广播到所有位置,并与之进行异或。 GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX
is equivalent to (b<0) ? INT64_MIN : INT64_MAX
GNU C保证签名的右移是算术(符号扩展),所以
(b>>63) ^ INT64_MAX
相当于(b<0) ? INT64_MIN : INT64_MAX
(b<0) ? INT64_MIN : INT64_MAX
in GNU C. (b<0) ? INT64_MIN : INT64_MAX
GNU C中的(b<0) ? INT64_MIN : INT64_MAX
。
ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL
ISO C保留了签名右移实现定义,但我们可以使用
b<0 ? ~0ULL : 0ULL
的三元组b<0 ? ~0ULL : 0ULL
b<0 ? ~0ULL : 0ULL
. b<0 ? ~0ULL : 0ULL
。 Compilers will optimize the following to sar
/ xor
, or equivalent instruction(s), but it has no implementation-defined behaviour. 编译器会将以下内容优化为
sar
/ xor
或等效指令,但它没有实现定义的行为。 AArch64 can use a shifted input operand for eor
just as well as it can for add
. AArch64可以使用移位输入操作数
eor
一样好,因为它可以用于add
。
// an earlier version of this answer used this
int64_t mask = (b<0) ? ~0ULL : 0; // compiles to sar with good compilers, but is not implementation-defined.
return mask ^ INT64_MAX;
Fun fact: AArch64 has a csinv
instruction: conditional-select inverse. 有趣的事实:AArch64有一个
csinv
指令:条件选择逆。 And it can put INT64_MIN into a register with a single 32-bit mov
instruction, thanks to its powerful immediate encodings for simple bit-patterns. 由于其对简单位模式的强大即时编码,它可以将INT64_MIN放入具有单个32位
mov
指令的寄存器中。 AArch64 GCC was already using csinv
for the MIN = ~MAX
trick for the original return (b<0) ? INT64_MIN : INT64_MAX;
AArch64 GCC已经使用
csinv
作为原始return (b<0) ? INT64_MIN : INT64_MAX;
的MIN = ~MAX
技巧return (b<0) ? INT64_MIN : INT64_MAX;
return (b<0) ? INT64_MIN : INT64_MAX;
version. 版。
clang 6.0 and earlier on Godbolt were using shr
/ add
for the plain (b<0) ? INT64_MIN : INT64_MAX;
在Godbolt上使用
shr
/ add
clang 6.0和更早版本(b<0) ? INT64_MIN : INT64_MAX;
(b<0) ? INT64_MIN : INT64_MAX;
version. 版。 It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think.
它看起来比clang7 / 8更有效,所以这是我认为的回归/错过优化错误。 (And it's the whole point of this section and why I wrote a 2nd version.)
(这是本节的重点以及为什么我写了第二个版本。)
I chose the MIN = MAX + 1
version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F . 我选择
MIN = MAX + 1
版本,因为它可以更好地自动矢量化:x86具有64位SIMD逻辑右移,但只有16和32位SIMD算法右移直到AVX512F 。 Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. 当然,使用SIMD进行带符号溢出检测可能会使AVX512的64位整数不值得。 Well maybe AVX2.
也许AVX2。 And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.
如果它是一些更大的计算的一部分, 否则可以有效地矢量化,然后解包到标量和后面糟透了。
For scalar it's truly a wash; 对于标量来说,这真的是一种洗涤; neither way compiles any better, and
sar/shr
perform identically, and so do add/xor
, on all CPUs that Agner Fog has tested. 既不会编译任何更好的,并且
sar/shr
执行相同,所以在Agner Fog测试的所有CPU上add/xor
。 ( https://agner.org/optimize/ ). ( https://agner.org/optimize/ )。
But +
can sometimes optimize into other things, though, so you could imagine gcc folding a later +
or -
of a constant into the overflow branch. 但是
+
有时可以优化到其他东西,所以你可以想象gcc将一个常量+
或-
一个常量折叠到溢出分支中。 Or possibly using LEA
for that add instead of ADD
to copy-and-add. 或者可能使用
LEA
进行添加,而不是使用ADD
进行复制和添加。 The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; 对于XOR与ADD,更简单的ALU执行单元的功率差异将从噪声中丢失,这些噪声来自执行无序执行和其他内容所需的所有功率的成本; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.
所有x86 CPU都具有单周期标量ADD和XOR,即使对于64位整数,即使在P4 Prescott / Nocona上也有其奇特的加法器。
Also @chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary
thing. 另外@chqrlie提出了一种紧凑的可读方式,用C语言编写它,而不是UB,看起来比超便携式
int mask = ternary
更好。
Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. 不依赖于MIN / MAX的任何特殊属性,因此可能对其他溢出检测条件的其他边界饱和有用。 Or in case a compiler does something better with this version.
或者,如果编译器使用此版本做得更好。
int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction for a given `b`
return (b<0) ? INT64_MIN : INT64_MAX;
}
return res;
}
which compiles as follows 编译如下
# gcc9.1 -O3 (clang chooses branchless)
signed_sat_add64_gnuc:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rdx, 9223372036854775807
test rsi, rsi # one of the addends is still available after
movabs rax, -9223372036854775808 # missed optimization: lea rdx, [rax+1]
cmovns rax, rdx # branchless selection of which saturation limit
ret
This is basically what @drwowe's inline asm does, but with a test
replacing one cmov. 这基本上是@ drwowe的内联asm所做的,但
test
代替了一个cmov。 (And of course different conditions on the cmov.) (当然还有cmov的不同条件。)
Another downside to this vs. the _v2
with shr/add is that this needs 2 constants. 与shr / add的
_v2
相比,另一个缺点是需要2个常量。 In a loop, this would tie up an extra register. 在一个循环中,这会占用一个额外的寄存器。 (Again unless
b
is a compile-time constant.) (再次,除非
b
是编译时常量。)
clang uses cmov
instead of a branch, and does spot the lea rax, [rcx + 1]
trick to avoid a 2nd 10-byte mov r64, imm64
instruction. clang使用
cmov
而不是分支,并且确实发现了lea rax, [rcx + 1]
技巧以避免第二个10字节的mov r64, imm64
指令。 (Or clang6.0 and earlier use the shr 63
/ add
trick instead of that cmov.) (或者clang6.0和更早版本使用
shr 63
/ add
技巧而不是那个cmov。)
The first version of this answer put int64_t sat = (b<0) ? MIN : MAX
这个答案的第一个版本是
int64_t sat = (b<0) ? MIN : MAX
int64_t sat = (b<0) ? MIN : MAX
outside the if()
, but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. int64_t sat = (b<0) ? MIN : MAX
在if()
,但gcc错过了在分支内部移动的优化,因此对于非溢出情况它根本不运行。 That's even better than running it off the critical path. 这比在关键路径上运行它更好。 (And doesn't matter if the compiler decides to go branchless).
(如果编译器决定无分支则无关紧要)。
But when I put it outside the if
and after the __builtin_saddll_overflow
, gcc was really dumb and saved the bool
result in an integer, then did the test/cmov, then used test
on the saddll_overflow
result again to put it back in FLAGS. 但是,当我把它放在外面
if
和后 __builtin_saddll_overflow
,GCC是非常愚蠢的,并保存在bool
一个整数结果,然后做测试/ CMOV,然后用来test
的saddll_overflow
再次结果放回标志。 Reordering the source fixed that. 重新排序源修复了那个。
I'm still looking for a decent portable solution, but this is as good as I've come up with so far: 我仍然在寻找一个不错的便携式解决方案,但这和我迄今为止提出的一样好:
Suggestions for improvements? 改进建议?
int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
asm("add %1, %0\n\t"
"jno 1f\n\t"
"cmovge %3, %0\n\t"
"cmovl %2, %0\n"
"1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
return x;
#else
return portable_saturated_add(x, y);
#endif
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.