[英]What optimization techniques are applied to Rust code that sums up a simple arithmetic sequence?
The code is naive: 该代码是幼稚的:
use std::time;
fn main() {
const NUM_LOOP: u64 = std::u64::MAX;
let mut sum = 0u64;
let now = time::Instant::now();
for i in 0..NUM_LOOP {
sum += i;
}
let d = now.elapsed();
println!("{}", sum);
println!("loop: {}.{:09}s", d.as_secs(), d.subsec_nanos());
}
The output is: 输出为:
$ ./test.rs.out
9223372036854775809
loop: 0.000000060s
$ ./test.rs.out
9223372036854775809
loop: 0.000000052s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
$ ./test.rs.out
9223372036854775809
loop: 0.000000041s
$ ./test.rs.out
9223372036854775809
loop: 0.000000046s
$ ./test.rs.out
9223372036854775809
loop: 0.000000047s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
The program almost ends immediately. 该程序几乎立即结束。 I also wrote an equivalent code in C using for loop, but it ran for a long time.
我还使用for循环在C中编写了等效的代码,但运行了很长时间。 I'm wondering what makes the Rust code so fast.
我想知道是什么使Rust代码这么快。
The C code: C代码:
#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
double time_elapse(struct timespec start) {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now.tv_sec - start.tv_sec +
(now.tv_nsec - start.tv_nsec) / 1000000000.;
}
int main() {
const uint64_t NUM_LOOP = 18446744073709551615u;
uint64_t sum = 0;
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
for (int i = 0; i < NUM_LOOP; ++i) {
sum += i;
}
double t = time_elapse(now);
printf("value of sum is: %llu\n", sum);
printf("time elapse is: %lf sec\n", t);
return 0;
}
The Rust code is compiled using -O
and the C code is compiled using -O3
. Rust代码使用
-O
编译,而C代码使用-O3
编译。 The C code is running so slow that it hasn't stopped yet. C代码运行太慢,以至于还没有停止。
After fixing the bug found by visibleman and Sandeep, both programs were printing the same number in almost no time. 修复了visibleman和Sandeep发现的错误之后,这两个程序几乎都立即打印了相同的数字。 I tried to reduce
NUM_LOOP
by one, results seemed reasonable considering an overflow. 我尝试将
NUM_LOOP
1,考虑到溢出,结果似乎合理。 Moreover, with NUM_LOOP = 1000000000
, both programs will not have overflow and produce correct answers in no time. 此外,在
NUM_LOOP = 1000000000
,两个程序都不会溢出并且不会立即产生正确的答案。 What optimizations are used here? 这里使用什么优化? I know we can use simple equations like
(0 + NUM_LOOP - 1) * NUM_LOOP / 2
to compute the result, but I don't think such computations are done by the compilers with an overflow case... 我知道我们可以使用简单的方程式
(0 + NUM_LOOP - 1) * NUM_LOOP / 2
来计算结果,但是我不认为此类计算是由编译器在发生溢出情况下完成的...
Since an int
can never be as big as your NUM_LOOP
, the program will loop eternally. 由于
int
永远不会比您的NUM_LOOP
大,因此该程序将永远循环。
const uint64_t NUM_LOOP = 18446744073709551615u;
for (int i = 0; i < NUM_LOOP; ++i) { // Change this to an uint64_t
If you fix the int bug, the compiler will optimize away these loops in both cases. 如果您修复了int错误,则在两种情况下编译器都会优化掉这些循环。
Your Rust code (without the prints and timing) compiles down to ( On Godbolt ): 您的Rust代码(没有打印内容和时间)会编译为( On Godbolt ):
movabs rax, -9223372036854775807
ret
LLVM just const-folds the whole function and calculates the final value for you. LLVM只是将整个函数折叠起来并为您计算最终值。
Let's make the upper limit dynamic (non constant) to avoid this aggressive constant folding: 让我们将上限设为动态(非常数)以避免这种激进的常数折叠:
pub fn foo(num: u64) -> u64 {
let mut sum = 0u64;
for i in 0..num {
sum += i;
}
sum
}
This results in ( Godbolt ): 结果是( Godbolt ):
test rdi, rdi ; if num == 0
je .LBB0_1 ; jump to .LBB0_1
lea rax, [rdi - 1] ; sum = num - 1
lea rcx, [rdi - 2] ; rcx = num - 2
mul rcx ; sum = sum * rcx
shld rdx, rax, 63 ; rdx = sum / 2
lea rax, [rdx + rdi] ; sum = rdx + num
add rax, -1 ; sum -= 1
ret
.LBB0_1:
xor eax, eax ; sum = 0
ret
As you can see that optimizer understood that you summed all numbers from 0 to num
and replaced your loop with a constant formula: ((num - 1) * (num - 2)) / 2 + num - 1
. 如您所见,优化器了解到您对从0到
num
所有数字求和,并用一个常量公式替换了循环: ((num - 1) * (num - 2)) / 2 + num - 1
。 As for the example above: the optimizer probably first optimized the code into this constant formula and did constant folding then. 对于上面的示例:优化器可能首先将代码优化为该常数公式,然后进行常数折叠。
clang
generates exactly the same assembly (unsurprisingly). clang
生成完全相同的程序集 (毫不奇怪)。 However, GCC doesn't seem to know about this optimization and generates pretty much the assembly you would expect (a loop) . (0..num).sum()
. (0..num).sum()
。 Despite this using more layers of abstractions (namely, iterators), the compiler generates exactly the same code as above. Duration
in Rust, you can use the {:?}
format specifier. Duration
,可以使用{:?}
格式说明符。 println!("{:.2?}", d);
prints the duration in the most fitting unit with a precision of 2. That's a fine way to print the time for almost all kinds of benchmarks. Your code is stuck in an infinite loop. 您的代码陷入了无限循环。
The comparison i < NUM_LOOP
will always return true since int i
will wrap around before reaching NUM_LOOP
比较
i < NUM_LOOP
将始终返回true,因为int i
将在达到NUM_LOOP
之前NUM_LOOP
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.