简体繁体 English

哪个整数操作在Rust中具有更高性能的替代方法？

[英]Which integer operations have higher performance alternate methods in Rust?

原文 2016-12-12 13:25:18 8 3 rust/ micro-optimization/ llvm-codegen

When writing integer functions in Rust which will run millions of times (think pixel processing), it's useful to use operations with the highest performance - similar to C/C++. 在Rust中编写将运行数百万次的整数函数（想想像素处理）时，使用性能最高的操作很有用 - 类似于C / C ++。

While the reference manual explains changes in behavior, it's not always clear which methods are higher performance than the standard ^{(see note 1.)} integer arithmetic operations. 虽然参考手册解释了行为的变化，但并不总是清楚哪种方法的性能高于标准^{（参见注释1）}整数算术运算。 I'd assume wrapping_add compiles down to something equivalent to C's addition. 我假设wrapping_add编译成等同于C的加法。

Of the standard operations (add / subtract / multiply / modulo / divide / shift / bit manipulation...), which operations have higher performance alternatives which aren't used by default? 在标准操作（加/减/乘/模/除/移位/位操作......）中，哪些操作具有更高性能的替代方法，默认情况下不使用？

Note: 注意：

By standard I mean integer arithmetic using symbols a + b , i / k or c % e ... etc. 按标准，我指的是使用符号a + b ， i / k或c % e等的整数运算。
What you would use when writing math expressions - unless you have a special need for using one of the methods that wraps or returns the overflow. 编写数学表达式时要使用的内容 - 除非您特别需要使用包装或返回溢出的方法之一。
I realize answering this question may require some research. 我意识到回答这个问题可能需要一些研究。 So I'm happy to do some checks by looking at resulting assembly to see which operations are using unchecked/primitive operations. 因此，我很高兴通过查看生成的程序集来进行一些检查，以查看哪些操作正在使用未经检查/原始操作。
It may be that the speed difference between checked / unchecked operations isn't significant, if that's the case I'd still like to be able to write a 'fast' version of a function to compare against the 'safe' version, to come to my own conclusion as to whether it's a reasonable choice for a given function. 可能是检查/未检查操作之间的速度差异不大，如果是这种情况，我仍然希望能够编写一个“快速”版本的函数来与“安全”版本进行比较，来关于它是否是给定函数的合理选择我自己的结论。
Having mentioned pixel-processing, SIMD has come up as a possible solution. 提到像素处理后，SIMD已成为可能的解决方案。 Even though this is a good suggestion. 尽管这是一个很好的建议。 That still leaves us with the cases which can't be optimized using SIMD, so the general case of fast integer arithmetic is still something to consider. 这仍然留给我们使用SIMD 无法优化的情况，因此快速整数算法的一般情况仍然需要考虑。

3 个解决方案

Of the standard operations (add / subtract / multiply / modulo / divide / shift / bit manipulation...), which operations have higher performance alternatives which aren't used by default? 在标准操作（加/减/乘/模/除/移位/位操作......）中，哪些操作具有更高性能的替代方法，默认情况下不使用？

Note that Rust was designed for performance; 请注意，Rust是为性能而设计的; as a result, while integer operations are checked in Debug , they are defined to wrap in Release unless you specifically instruct the compiler otherwise. 因此，在Debug中检查整数操作时，它们被定义为包装在Release中，除非您特别指示编译器。

As a result, in release mode with default options, there is strictly no performance difference between: 因此，在具有默认选项的发布模式下，两者之间严格没有性能差异：

+ and wrapping_add +和wrapping_add
- and wrapping_sub -和wrapping_sub
* and wrapping_mul *和wrapping_mul
/ and wrapping_div /和wrapping_div
% and wrapping_rem %和wrapping_rem
<< and wrapping_shl <<和wrapping_shl
>> and wrapping_shr >>和wrapping_shr

For unsigned integers, the performance is thus strictly like that of C or C++; 对于无符号整数，性能因此严格地类似于C或C ++; for signed integers, however, the optimizer might yield different results since underflow/overflow on signed integers is undefined behavior in C and C++ (gcc and Clang accept a -fwrapv flag to mandate wrapping even for signed integers, but it's not the default). 但是，对于有符号整数，优化器可能会产生不同的结果，因为有符号整数的下溢/溢出是C和C ++中的未定义行为（gcc和Clang接受-fwrapv标志，即使对于有符号整数也要求包装，但它不是默认值）。

I expect that using the checked_* , overflow_* and saturating_* methods will however be slower in general. 我希望使用checked_* ， overflow_*和saturating_*方法通常会慢一些。

An interesting tangent, then, is to understand what happens when you flip the switch and explicitly require checked arithmetic. 因此，有趣的切线是了解当您翻转开关并明确要求检查算术时会发生什么。

Currently, the Rust implementation ¹ is a precise implementation of underflow/overflow checking. 目前，Rust实现¹是下溢/溢出检查的精确实现。 Each addition, subtraction, multiplication, ... is checked independently, and the optimizer is not good at fusing those branches. 每个加法，减法，乘法，...都是独立检查的，优化器不擅长融合这些分支。

Specifically, a precise implementation precludes temporary overflows: 5 + x - 5 cannot be optimized as x , because 5 + x could overflow. 具体来说， 精确的实现排除了临时溢出： 5 + x - 5不能优化为x ，因为5 + x可能溢出。 It also precludes auto-vectorization in general. 它还排除了一般的自动矢量化。

Only when the optimizer can prove the absence of overflow (which it generally cannot) you may hope to regain a branch-free path which is more amenable to optimizations. 只有当优化器可以证明没有溢出（通常不能）时，您可能希望重新获得更易于优化的无分支路径。

One should note that on general software the impact is barely noticeable, as arithmetic instructions represent a small portion of the overall cost. 应该注意的是，在通用软件上，影响几乎不可察觉，因为算术指令只占总成本的一小部分。 When this proportion rises however, it can be very noticeable, and indeed it shows up in part of the SPEC2006 benchmark with Clang. 然而，当这个比例上升时，它可能非常明显，实际上它出现在与Clang的SPEC2006基准测试的一部分中。

This overhead was sufficient to be deemed unsuitable for the checks to be activated by default. 这种开销足以被认为不适合默认激活检查。

¹ This is due to technical limitations on LLVM side; ¹ 这是由于LLVM方面的技术限制; the Rust implementation just delegates to LLVM. Rust实现只委托给LLVM。

In the future, there is hope that a fuzzy implementation of the checks would be available. 将来，我们希望能够获得模糊的支票实施。 The idea behind a fuzzy implementation is that instead of checking each and every operation, they are just executed and a flag is set or the values are poisoned in case of underflow/overflow. 模糊实现背后的想法是，不是检查每个操作，而是执行它们并且设置标志或者在下溢/溢出的情况下中断值。 Then, before using the result, a check (branch) is executed. 然后，在使用结果之前，执行检查（分支）。

According to Joe Duffy, they had such an implementation in Midori and the performance impact was barely noticeable, so it seems to be feasible. 根据Joe Duffy的说法，他们在Midori中有这样的实现，性能影响几乎不可察觉，因此它似乎是可行的。 I am not aware of any effort to have anything similar in LLVM yet, though. 但是，我还没有意识到在LLVM中有任何类似的努力。

Rust gives no guarantees as to the speed of its operations. Rust不保证其运营速度。 If you want guarantees, you need to call into assembler. 如果您需要保证，则需要调用汇编程序。

That said, currently Rust forwards to LLVM, so you can just call the intrinsics, which map 1:1 to LLVM intrinsics and use those guarantees. 也就是说，当前Rust转发到LLVM，所以你可以调用内在函数，它将1：1映射到LLVM内在函数并使用这些保证。 Still, whatever you do that's not asm, be aware that the optimizer might have a different opinion of what you consider optimal, and thus unoptimize your manual calls to LLVM intrinsics. 但是，无论你做什么都不是asm，请注意优化器可能对你认为最优的东西有不同的看法，因此不优化你对LLVM内在函数的手动调用。

That said, Rust strives to be as fast as possible, so you can assume (or simply look at the standard library's implementation) that all operations that have an LLVM intrinsic that does the same will map to that LLVM intrinsic and thus be as fast as LLVM can do it. 也就是说，Rust努力尽可能快，因此您可以假设（或者只是查看标准库的实现）所有具有相同LLVM内在函数的操作将映射到该LLVM内在函数，因此可以像LLVM可以做到这一点。

There is no general rule as to which operation is the fastest for a given basic arithmetic operation, since it totally depends on your use case. 对于给定的基本算术运算，哪个操作最快是没有一般规则的，因为它完全取决于您的用例。

think pixel processing 想像素处理

Then you shouldn't be thinking single-valued operations at all; 那么你根本不应该考虑单值操作; you want to use SIMD instructions instead. 您想要使用SIMD指令。 These are currently not available in stable Rust, but some are accessible through feature-gated functions and all are available through assembly. 这些目前在稳定的Rust中不可用，但有些可通过功能门控功能访问，并且所有功能都可通过汇编获得。

Is it possible LLVM optimizes code into SIMD, like it does for clang? 是否有可能LLVM将代码优化为SIMD，就像它对clang一样？

As aochagavia already replied , yes, LLVM will autovectorize certain types of code. 正如aochagavia已经回复，是的，LLVM将自动调整某些类型的代码。 However, when you demand the highest performance, you don't usually want to leave yourself at the whims of the optimizer. 但是，当您要求最高性能时，通常不希望自己处于优化器的一时兴起。 I tend to hope for autovectorization in my normal run-of-the-mill code, then write the straight-line code for my heavy-math kernels, then write SIMD code and test for correctness and benchmark for speed. 我倾向于希望在我的普通普通代码中进行自动向量化，然后为我的重数学内核编写直线代码，然后编写SIMD代码并测试速度的正确性和基准。