简体   繁体   English

签名饱和添加64位整数?

[英]Signed saturated add of 64-bit ints?

I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. 我正在寻找一些用于签名饱和64位加法的C代码,它使用gcc优化器编译为高效的x86-64代码。 Portable code would be ideal, although an asm solution could be used if necessary. 便携式代码是理想的,尽管如果需要可以使用asm解决方案。

static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;

int64 signed_saturated_add(int64 x, int64 y) {
  bool x_is_negative = (x & kint64min) != 0;
  bool y_is_negative = (y & kint64min) != 0;
  int64 sum = x+y;
  bool sum_is_negative = (sum & kint64min) != 0;
  if (x_is_negative != y_is_negative) return sum;  // can't overflow
  if (x_is_negative && !sum_is_negative) return kint64min;
  if (!x_is_negative && sum_is_negative) return kint64max;
  return sum;
}

The function as written produces a fairly lengthy assembly output with several branches. 写入的函数产生具有多个分支的相当长的汇编输出。 Any tips on optimization? 有关优化的提示吗? Seems like it ought to be be implementable with just an ADD with a few CMOV instructions but I'm a little bit rusty with this stuff. 看起来它应该只用一个带有一些CMOV指令的ADD来实现,但我对这些东西有点生疏。

This may be optimized further but here is a portable solution. 这可以进一步优化,但这是一个便携式解决方案。 It does not invoked undefined behavior and it checks for integer overflow before it could occur. 它不会调用未定义的行为,它会在可能发生之前检查整数溢出。

#include <stdint.h>

int64_t sadd64(int64_t a, int64_t b)
{
    if (a > 0) {
        if (b > INT64_MAX - a) {
            return INT64_MAX;
        }
    } else if (b < INT64_MIN - a) {
            return INT64_MIN;
    }

    return a + b;
}

This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. 这是一个在其中一条评论中一直延续下去的解决方案,并且也在ouah的解决方案中使用过。 here the generated code should be without conditional jumps 这里生成的代码应该没有条件跳转

int64_t signed_saturated_add(int64_t x, int64_t y) {
  // determine the lower or upper bound of the result
  int64_t ret =  (x < 0) ? INT64_MIN : INT64_MAX;
  // this is always well defined:
  // if x < 0 this adds a positive value to INT64_MIN
  // if x > 0 this subtracts a positive value from INT64_MAX
  int64_t comp = ret - x;
  // the condition is equivalent to
  // ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
  if ((x < 0) == (y > comp)) ret = x + y;
  return ret;
}

The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN is INT64_MAX+1 . 第一个看起来好像会有一个条件移动,但由于特殊值,我的编译器得到一个加法:在2的补码INT64_MININT64_MAX+1 There is then only one conditional move for the assignment of the sum, in case anything is fine. 如果一切正常,那么只有一个有条件的移动来分配总和。

All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow. 所有这些都没有UB,因为在抽象状态机中,只有在没有溢出时才会执行求和。

Related: unsigned saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C? 相关: unsigned饱和度更容易,并且在纯ISO C中有效: 如何在C中进行无符号饱和加法?


Compilers are terrible at all of the pure C options proposed so far. 编译器在目前提出的所有纯C选项中都很糟糕

They don't see that they can use the signed-overflow flag result from an add instruction to detect that saturation to INT64_MIN/MAX is necessary. 他们没有看到他们可以使用add指令的signed-overflow标志结果来检测是否需要INT64_MIN / MAX饱和。 AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add . AFAIK没有编译器认为是读取add的OF标志结果的纯C模式。

Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. 内联asm在这里并不是一个坏主意,但是我们可以避免使用GCC的内置函数来公开UB安全包装带有布尔溢出结果的带符号加法。 https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.) (如果你打算使用GNU C inline asm,这将限制你和这些GNU C内置程序一样多。而且这些内置程序不是特定于arch的。它们确实需要gcc5或更新,但是gcc4.9和更早版本基本上都是过时的.https ://gcc.gnu.org/wiki/DontUseInlineAsm - 它会破坏不断的传播并且难以维护。)


This version uses the fact that INT64_MIN = INT64_MAX + 1ULL (for 2's complement) to select INT64_MIN/MAX based on the sign of b . 此版本使用INT64_MIN = INT64_MAX + 1ULL (2的补码)的事实,根据b的符号选择INT64_MIN / MAX。 Signed-overflow UB is avoided by using uint64_t for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). 通过对该添加使用uint64_t来避免有符号溢出UB,并且GNU C定义了将无符号整数转换为无法表示其值的有符号类型的行为(未使用的位模式)。 Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX 目前的gcc / clang受益于这个手持,因为他们没有从三元组(b<0) ? INT64_MIN : INT64_MAX找出这个技巧(b<0) ? INT64_MIN : INT64_MAX (b<0) ? INT64_MIN : INT64_MAX . (b<0) ? INT64_MIN : INT64_MAX (See below for the alternate version using that). (请参阅下面的替代版本)。 I haven't checked the asm on 32-bit architectures. 我没有在32位架构上检查过asm。

GCC only supports 2's complement integer types , so a function using __builtin_add_overflow doesn't have to care about portability to C implementations that use 1's complement ( where the same identity holds ) or sign/magnitude (where it doesn't), even if you made a version for long or int instead of int64_t . GCC仅支持2的补码整数类型 ,因此使用__builtin_add_overflow的函数不必关心使用1的补码( 其中相同的标识保持 )或符号/幅度(其中没有)的C实现的可移植性,即使您为longint而不是int64_t创建了一个版本。

#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif

// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
    long long res;
    bool overflow = __builtin_saddll_overflow(a, b, &res);
    if (overflow) {
            // overflow is only possible in one direction depending on the sign bit
            return ((uint64_t)b >> 63) + INT64_MAX;
            // INT64_MIN = INT64_MAX + 1  wraparound, done with unsigned
    }

    return res;
}

Another option is (b>>63) ^ INT64_MAX which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. 另一个选项是(b>>63) ^ INT64_MAX ,如果手动矢量化,SIMD XOR可以在比SIMD ADD更多的端口上运行,如Skylake之前的英特尔,可能会很有用。 (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd ) See Add saturate 32-bit signed ints intrinsics? (但是x86没有SIMD 64位算术右移,只有逻辑,所以这只会对int32_t版本有所帮助,并且你需要首先有效地检测溢出。或者你可以使用变量混合符号位,如blendvpd )请参阅添加饱和32位有符号整数内在函数? with x86 SIMD intrinsics (SSE2/SSE4) 使用x86 SIMD内在函数(SSE2 / SSE4)

On Godbolt with gcc9 and clang8 (along with the other implementations from other answers): 带有gcc9和clang8的Godbolt上 (以及其他答案中的其他实现):

# gcc9.1 -O3   (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:

        add     rdi, rsi                   # the actual add
        jo      .L3                        # jump on signed overflow
        mov     rax, rdi                   # retval = the non-overflowing add result
        ret
.L3:
        movabs  rax, 9223372036854775807   # INT64_MAX
        shr     rsi, 63                    # b is still available after the ADD
        add     rax, rsi
        ret

When inlining into a loop, the mov imm64 can be hoisted. 内联到循环中时,可以提升mov imm64 If b is needed afterwards then we might need an extra mov , otherwise shr / add can destroy b , leaving the INT64_MAX constant in a register undamaged. 如果之后需要b ,那么我们可能需要一个额外的mov ,否则shr / add会破坏b ,使INT64_MAX在寄存器中保持不变。 Or if the compiler wants to use cmov (like clang does), it has to mov / shr because it has to get the saturation constant ready before the add, preserving both operands. 或者如果编译器想要使用cmov (就像clang那样),它必须是mov / shr因为它必须在add 之前准备饱和度常量,保留两个操作数。

Notice that the critical path for the non-overflowing case only includes an add and a not-taken jo . 请注意, 非溢出情况的关键路径仅包括add和not-taken jo These can't macro-fuse into a single uop even on Sandybridge-family , but the jo only costs throughput not latency thanks to branch prediction + speculative execution. 即使在Sandybridge系列中 ,这些也不能宏观融合到单个uop中 ,但由于分支预测+推测执行, jo只会花费吞吐量而不是延迟。 (When inlining, the mov will go away.) (内联时, mov会消失。)

If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno instead of jo , like clang does. 如果饱和度实际上并不罕见并且分支预测是一个问题,那么使用配置文件引导优化进行编译, gcc将有希望进行if-conversion并使用cmovno而不是jo ,就像clang一样。 This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. 这将MIN / MAX选择放在关键路径上,以及CMOV本身。 The MIN/MAX selection can run in parallel with the add . MIN / MAX选项可以与add并行运行。

You could use a<0 instead. 您可以使用a<0代替。 I used b because I think most people would write x = sadd(x, 123) instead of x = sadd(123, x) , and having a compile-time-constant allows the b<0 to optimize away . 我使用b因为我认为大多数人会写x = sadd(x, 123)而不是x = sadd(123, x) ,并且有一个编译时常量允许b<0来优化 For maximal optimization opportunity, you could use if (__builtin_constant_p(a)) to ask the compiler if a was a compile-time constant. 对于最大的优化机会,你可以使用if (__builtin_constant_p(a))问编译器,如果a是一个编译时间常数。 That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. 这适用于gcc,但是clang在内联之前过早地评估了const-ness方式,所以除了带有clang的宏之外它没用。 (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow : it puts both inputs in registers and still does the add. GCC and Clang just return a constant.) (相关:ICC19不通过__builtin_saddll_overflow进行常量传播:它将两个输入都放在寄存器中并仍然执行添加.GCC和Clang只返回一个常量。)

This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add + cmovo . 这个优化在提升MIN / MAX选择的循环中特别有用,只留下add + cmovo (Or add + jo to a mov .) (或者add + jo addmov 。)

cmov is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. cmov是Broadwell之前针对Intel P6系列和SnB系列的2 cmov指令,因为它有3个输入。 On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. 在其他x86 CPU(Broadwell / Skylake和AMD)上,它是一个单指令。 On most such CPUs it has 1 cycle latency. 在大多数此类CPU上,它具有1个周期延迟。 It's a simple ALU select operation; 这是一个简单的ALU选择操作; the only complication is 3 inputs (2 regs + FLAGS). 唯一的复杂因素是3输入(2 regs + FLAGS)。 But on KNL it's still 2-cycle latency. 但是在KNL上,它仍然是2周期延迟。


Unfortunately gcc for AArch64 fails to use adds to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch. 不幸的是,AArch64的gcc无法使用adds设置标志并检查V(溢出)标志结果,因此它花费了几条指令决定是否分支。

Clang does a great job , and AArch64's immediate encodings can represent INT64_MAX as an operand to eor or add . Clang做得很好 ,AArch64的即时编码可以将INT64_MAX表示为eoradd的操作数。

// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
    orr     x9, xzr, #0x7fffffffffffffff      // mov constant = OR with zero reg
    adds    x8, x0, x1                        // add and set flags
    add     x9, x9, x1, lsr #63               // sat = (b shr 63) + MAX
    csel    x0, x9, x8, vs                    // conditional-select, condition = VS = oVerflow flag Set
    ret

Optimizing MIN vs. MAX selection 优化MINMAX选择

As noted above, return (b<0) ? INT64_MIN : INT64_MAX; 如上所述, return (b<0) ? INT64_MIN : INT64_MAX; return (b<0) ? INT64_MIN : INT64_MAX; doesn't compile optimally with most versions of gcc/clang; 不能用大多数版本的gcc / clang进行最佳编译; they generate both constant in registers and cmov to select, or something similar on other ISAs. 它们在寄存器和cmov中生成常量以进行选择,或者在其他ISA上生成类似的东西。

We can assume 2's complement because GCC only supports 2's complement integer types , and because the ISO C optional int64_t type is guaranteed to be 2's complement if it exists. 我们可以假设2的补码,因为GCC只支持2的补码整数类型 ,并且因为ISO C可选的int64_t类型如果存在则保证是2的补码。 (Signed overflow of int64_t is still UB, this allows it to be a simple typedef of long or long long ). int64_t有符号溢出仍然是UB,这允许它是longlong long的简单typedef )。

(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow , a version of this function for long long or int couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF... to OR the sign bit of b into a max-magnitude number.) (在支持一些等效的__builtin_add_overflow的符号/大小C实现上, long long函数或int的这个函数的一个版本不能使用SHR / ADD技巧。为了极端的可移植性,你可能只使用简单的三元组,或者用于符号/幅度具体你可以return (b&0x800...) | 0x7FFF...b的符号位转换为最大幅度数。)

For two's complement, the bit-patterns for MIN and MAX are 0x8000... (just the high bit set) and 0x7FFF... (all other bits set). 对于二进制补码,MIN和MAX的位模式为0x8000... (仅高位设置)和0x7FFF... (所有其他位设置)。 They have a couple interesting properties: MIN = MAX + 1 (if computed with unsigned on the bit-pattern), and MIN = ~MAX : their bit-patterns are bitwise inverses, aka one's complement of each other. 它们有几个有趣的属性: MIN = MAX + 1 (如果在位模式上使用无符号计算), MIN = ~MAX :它们的位模式是按位反转,也就是彼此的一个补码。

The MIN = ~MAX property follows from ~x = -x - 1 (a re-arrangement of the standard -x = ~x + 1 2's complement identity ) and the fact that MIN = -MAX - 1 . MIN = ~MAX属性遵循~x = -x - 1 (标准-x = ~x + 1 2的补码标识的重新排列)和MIN = -MAX - 1的事实。 The +1 property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well . +1属性是不相关的,并且从简单翻转从最正面到最负面,并且也适用于有符号整数一个补码编码 (But not sign/magnitude; you'd get -0 where the unsigned magnitude ). (但不是符号/幅度;你得到-0无符号幅度)。

The above function uses the MIN = MAX + 1 trick. 上面的函数使用MIN = MAX + 1技巧。 The MIN = ~MAX trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0 or 0xFF... ), and XORing with that. MIN = ~MAX技巧也可用于通过算术右移(创建00xFF... )将符号位广播到所有位置,并与之进行异或。 GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX is equivalent to (b<0) ? INT64_MIN : INT64_MAX GNU C保证签名的右移是算术(符号扩展),所以(b>>63) ^ INT64_MAX相当于(b<0) ? INT64_MIN : INT64_MAX (b<0) ? INT64_MIN : INT64_MAX in GNU C. (b<0) ? INT64_MIN : INT64_MAX GNU C中的(b<0) ? INT64_MIN : INT64_MAX

ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL ISO C保留了签名右移实现定义,但我们可以使用b<0 ? ~0ULL : 0ULL的三元组b<0 ? ~0ULL : 0ULL b<0 ? ~0ULL : 0ULL . b<0 ? ~0ULL : 0ULL Compilers will optimize the following to sar / xor , or equivalent instruction(s), but it has no implementation-defined behaviour. 编译器会将以下内容优化为sar / xor或等效指令,但它没有实现定义的行为。 AArch64 can use a shifted input operand for eor just as well as it can for add . AArch64可以使用移位输入操作数eor一样好,因为它可以用于add

        // an earlier version of this answer used this
        int64_t mask = (b<0) ? ~0ULL : 0;  // compiles to sar with good compilers, but is not implementation-defined.
        return mask ^ INT64_MAX;

Fun fact: AArch64 has a csinv instruction: conditional-select inverse. 有趣的事实:AArch64有一个csinv指令:条件选择逆。 And it can put INT64_MIN into a register with a single 32-bit mov instruction, thanks to its powerful immediate encodings for simple bit-patterns. 由于其对简单位模式的强大即时编码,它可以将INT64_MIN放入具有单个32位mov指令的寄存器中。 AArch64 GCC was already using csinv for the MIN = ~MAX trick for the original return (b<0) ? INT64_MIN : INT64_MAX; AArch64 GCC已经使用csinv作为原始return (b<0) ? INT64_MIN : INT64_MAX;MIN = ~MAX技巧return (b<0) ? INT64_MIN : INT64_MAX; return (b<0) ? INT64_MIN : INT64_MAX; version. 版。

clang 6.0 and earlier on Godbolt were using shr / add for the plain (b<0) ? INT64_MIN : INT64_MAX; 在Godbolt上使用shr / add clang 6.0和更早版本(b<0) ? INT64_MIN : INT64_MAX; (b<0) ? INT64_MIN : INT64_MAX; version. 版。 It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. 它看起来比clang7 / 8更有效,所以这是我认为的回归/错过优化错误。 (And it's the whole point of this section and why I wrote a 2nd version.) (这是本节的重点以及为什么我写了第二个版本。)

I chose the MIN = MAX + 1 version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F . 我选择MIN = MAX + 1版本,因为它可以更好地自动矢量化:x86具有64位SIMD逻辑右移,但只有16和32位SIMD算法右移直到AVX512F Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. 当然,使用SIMD进行带符号溢出检测可能会使AVX512的64位整数不值得。 Well maybe AVX2. 也许AVX2。 And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks. 如果它是一些更大的计算的一部分, 否则可以有效地矢量化,然后解包到标量和后面糟透了。

For scalar it's truly a wash; 对于标量来说,这真的是一种洗涤; neither way compiles any better, and sar/shr perform identically, and so do add/xor , on all CPUs that Agner Fog has tested. 既不会编译任何更好的,并且sar/shr执行相同,所以在Agner Fog测试的所有CPU上add/xor ( https://agner.org/optimize/ ). https://agner.org/optimize/ )。

But + can sometimes optimize into other things, though, so you could imagine gcc folding a later + or - of a constant into the overflow branch. 但是+有时可以优化到其他东西,所以你可以想象gcc将一个常量+-一个常量折叠到溢出分支中。 Or possibly using LEA for that add instead of ADD to copy-and-add. 或者可能使用LEA进行添加,而不是使用ADD进行复制和添加。 The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; 对于XOR与ADD,更简单的ALU执行单元的功率差异将从噪声中丢失,这些噪声来自执行无序执行和其他内容所需的所有功率的成本; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders. 所有x86 CPU都具有单周期标量ADD和XOR,即使对于64位整数,即使在P4 Prescott / Nocona上也有其奇特的加法器。

Also @chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary thing. 另外@chqrlie提出了一种紧凑的可读方式,用C语言编写它,而不是UB,看起来比超便携式int mask = ternary更好。

The earlier "simpler" version of this function 这个函数的早期“更简单”版本

Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. 不依赖于MIN / MAX的任何特殊属性,因此可能对其他溢出检测条件的其他边界饱和有用。 Or in case a compiler does something better with this version. 或者,如果编译器使用此版本做得更好。

int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
    long long res;
    bool overflow = __builtin_saddll_overflow(a, b, &res);
    if (overflow) {
            // overflow is only possible in one direction for a given `b`
            return (b<0) ? INT64_MIN : INT64_MAX;
    }
    return res;
}

which compiles as follows 编译如下

# gcc9.1 -O3   (clang chooses branchless)
signed_sat_add64_gnuc:
        add     rdi, rsi                   # the actual add
        jo      .L3                        # jump on signed overflow
        mov     rax, rdi                   # retval = the non-overflowing add result
        ret
.L3:
        movabs  rdx, 9223372036854775807
        test    rsi, rsi                      # one of the addends is still available after
        movabs  rax, -9223372036854775808     # missed optimization: lea rdx, [rax+1]
        cmovns  rax, rdx                      # branchless selection of which saturation limit
        ret

This is basically what @drwowe's inline asm does, but with a test replacing one cmov. 这基本上是@ drwowe的内联asm所做的,但test代替了一个cmov。 (And of course different conditions on the cmov.) (当然还有cmov的不同条件。)

Another downside to this vs. the _v2 with shr/add is that this needs 2 constants. 与shr / add的_v2相比,另一个缺点是需要2个常量。 In a loop, this would tie up an extra register. 在一个循环中,这会占用一个额外的寄存器。 (Again unless b is a compile-time constant.) (再次,除非b是编译时常量。)

clang uses cmov instead of a branch, and does spot the lea rax, [rcx + 1] trick to avoid a 2nd 10-byte mov r64, imm64 instruction. clang使用cmov而不是分支,并且确实发现了lea rax, [rcx + 1]技巧以避免第二个10字节的mov r64, imm64指令。 (Or clang6.0 and earlier use the shr 63 / add trick instead of that cmov.) (或者clang6.0和更早版本使用shr 63 / add技巧而不是那个cmov。)


The first version of this answer put int64_t sat = (b<0) ? MIN : MAX 这个答案的第一个版本是int64_t sat = (b<0) ? MIN : MAX int64_t sat = (b<0) ? MIN : MAX outside the if() , but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. int64_t sat = (b<0) ? MIN : MAXif() ,但gcc错过了在分支内部移动的优化,因此对于非溢出情况它根本不运行。 That's even better than running it off the critical path. 这比在关键路径上运行它更好。 (And doesn't matter if the compiler decides to go branchless). (如果编译器决定无分支则无关紧要)。

But when I put it outside the if and after the __builtin_saddll_overflow , gcc was really dumb and saved the bool result in an integer, then did the test/cmov, then used test on the saddll_overflow result again to put it back in FLAGS. 但是,当我把它放在外面if __builtin_saddll_overflow ,GCC是非常愚蠢的,并保存在bool一个整数结果,然后做测试/ CMOV,然后用来testsaddll_overflow再次结果放回标志。 Reordering the source fixed that. 重新排序源修复了那个。

I'm still looking for a decent portable solution, but this is as good as I've come up with so far: 我仍然在寻找一个不错的便携式解决方案,但这和我迄今为止提出的一样好:

Suggestions for improvements? 改进建议?

int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
  asm("add %1, %0\n\t"
      "jno 1f\n\t"
      "cmovge %3, %0\n\t"
      "cmovl %2, %0\n"
      "1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
  return x;
#else
  return portable_saturated_add(x, y);
#endif
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何准确地对64位整数进行乘法和除法? - How can I multiply and divide 64-bit ints accurately? 如何将2 ^ 63添加到带符号的64位整数并将其转换为无符号的64位整数,而不使用中间的128位整数 - How to add 2^63 to a signed 64-bit integer and cast it to a unsigned 64-bit integer without using 128-bit integer in the middle SIMD使用无符号乘法对64位* 64位到128位进行签名 - SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit 将有符号 32 位存储在无符号 64 位 int 中 - Store signed 32-bit in unsigned 64-bit int 以十六进制和八进制打印带符号的 64 位 integer - Print a signed 64-bit integer in hexadecimal and octal 不使用64位数据类型的32位带符号整数乘法 - 32-bit signed integer multiplication without using 64-bit data type 使用 AVX512 将压缩的 64 位整数转换为带符号饱和的压缩 8 位整数 - Converting packed 64-bit integers to packed 8-bit integers with signed saturation using AVX512 减去两个无符号的32位整数并分配给64位有符号整数 - subtracting two unsigned 32-bit integers and assign to 64-bit signed integer 如何使用 CMake 确定 64 位有符号整数类型和 printf 说明符? - How to determine the 64-bit signed integer type and printf specifier with CMake? GCC 的 64 位版本未编译 64 位 exe - 64-bit version of GCC not compiling 64-bit exe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM