哪些优化技术应用于总结简单算术序列的Rust代码？

Question

The code is naive: 该代码是幼稚的：

use std::time;

fn main() {
    const NUM_LOOP: u64 = std::u64::MAX;
    let mut sum = 0u64;
    let now = time::Instant::now();
    for i in 0..NUM_LOOP {
        sum += i;
    }
    let d = now.elapsed();
    println!("{}", sum);
    println!("loop: {}.{:09}s", d.as_secs(), d.subsec_nanos());
}

The output is: 输出为：

$ ./test.rs.out
9223372036854775809
loop: 0.000000060s
$ ./test.rs.out
9223372036854775809
loop: 0.000000052s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
$ ./test.rs.out
9223372036854775809
loop: 0.000000041s
$ ./test.rs.out
9223372036854775809
loop: 0.000000046s
$ ./test.rs.out
9223372036854775809
loop: 0.000000047s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s

The program almost ends immediately. 该程序几乎立即结束。 I also wrote an equivalent code in C using for loop, but it ran for a long time. 我还使用for循环在C中编写了等效的代码，但运行了很长时间。 I'm wondering what makes the Rust code so fast. 我想知道是什么使Rust代码这么快。

The C code: C代码：

#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

double time_elapse(struct timespec start) {
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);
    return now.tv_sec - start.tv_sec +
           (now.tv_nsec - start.tv_nsec) / 1000000000.;
}

int main() {
    const uint64_t NUM_LOOP = 18446744073709551615u;
    uint64_t sum = 0;
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);

    for (int i = 0; i < NUM_LOOP; ++i) {
        sum += i;
    }

    double t = time_elapse(now);
    printf("value of sum is: %llu\n", sum);
    printf("time elapse is: %lf sec\n", t);

    return 0;
}

The Rust code is compiled using -O and the C code is compiled using -O3 . Rust代码使用-O编译，而C代码使用-O3编译。 The C code is running so slow that it hasn't stopped yet. C代码运行太慢，以至于还没有停止。

After fixing the bug found by visibleman and Sandeep, both programs were printing the same number in almost no time. 修复了visibleman和Sandeep发现的错误之后，这两个程序几乎都立即打印了相同的数字。 I tried to reduce NUM_LOOP by one, results seemed reasonable considering an overflow. 我尝试将NUM_LOOP 1，考虑到溢出，结果似乎合理。 Moreover, with NUM_LOOP = 1000000000 , both programs will not have overflow and produce correct answers in no time. 此外，在NUM_LOOP = 1000000000 ，两个程序都不会溢出并且不会立即产生正确的答案。 What optimizations are used here? 这里使用什么优化？ I know we can use simple equations like (0 + NUM_LOOP - 1) * NUM_LOOP / 2 to compute the result, but I don't think such computations are done by the compilers with an overflow case... 我知道我们可以使用简单的方程式(0 + NUM_LOOP - 1) * NUM_LOOP / 2来计算结果，但是我不认为此类计算是由编译器在发生溢出情况下完成的...

Answer 1

Since an int can never be as big as your NUM_LOOP , the program will loop eternally. 由于int永远不会比您的NUM_LOOP大，因此该程序将永远循环。

const uint64_t NUM_LOOP = 18446744073709551615u;

for (int i = 0; i < NUM_LOOP; ++i) { // Change this to an uint64_t

If you fix the int bug, the compiler will optimize away these loops in both cases. 如果您修复了int错误，则在两种情况下编译器都会优化掉这些循环。

Answer 2

Your Rust code (without the prints and timing) compiles down to ( On Godbolt ): 您的Rust代码（没有打印内容和时间）会编译为（ On Godbolt ）：

movabs rax, -9223372036854775807
ret

LLVM just const-folds the whole function and calculates the final value for you. LLVM只是将整个函数折叠起来并为您计算最终值。

Let's make the upper limit dynamic (non constant) to avoid this aggressive constant folding: 让我们将上限设为动态（非常数）以避免这种激进的常数折叠：

pub fn foo(num: u64) -> u64 {
    let mut sum = 0u64;
    for i in 0..num {
        sum += i;
    }

    sum
}

This results in ( Godbolt ): 结果是（ Godbolt ）：

  test rdi, rdi            ; if num == 0
  je .LBB0_1               ; jump to .LBB0_1
  lea rax, [rdi - 1]       ; sum = num - 1
  lea rcx, [rdi - 2]       ; rcx = num - 2
  mul rcx                  ; sum = sum * rcx
  shld rdx, rax, 63        ; rdx = sum / 2
  lea rax, [rdx + rdi]     ; sum = rdx + num
  add rax, -1              ; sum -= 1
  ret
.LBB0_1:
  xor eax, eax             ; sum = 0
  ret

As you can see that optimizer understood that you summed all numbers from 0 to num and replaced your loop with a constant formula: ((num - 1) * (num - 2)) / 2 + num - 1 . 如您所见，优化器了解到您对从0到num所有数字求和，并用一个常量公式替换了循环： ((num - 1) * (num - 2)) / 2 + num - 1 。 As for the example above: the optimizer probably first optimized the code into this constant formula and did constant folding then. 对于上面的示例：优化器可能首先将代码优化为该常数公式，然后进行常数折叠。

Additional notes 补充笔记

The two other answers already point out your bug in the C program. 另外两个答案已经指出了您在C程序中的错误。 When fixed, clang generates exactly the same assembly (unsurprisingly). 修复后， clang 生成完全相同的程序集（毫不奇怪）。 However, GCC doesn't seem to know about this optimization and generates pretty much the assembly you would expect (a loop) . 但是，GCC似乎并不了解这种优化，并且会生成您所期望的程序集（循环）。
In Rust, an easier and more idiomatic way to write your code would be (0..num).sum() . 在Rust中，一种更简单，更惯用的方式编写代码是(0..num).sum() 。 Despite this using more layers of abstractions (namely, iterators), the compiler generates exactly the same code as above. 尽管这样做使用了更多的抽象层（即迭代器），但编译器仍生成与上面完全相同的代码。
To print a Duration in Rust, you can use the {:?} format specifier. 要在Rust中打印Duration ，可以使用{:?}格式说明符。 println!("{:.2?}", d); prints the duration in the most fitting unit with a precision of 2. That's a fine way to print the time for almost all kinds of benchmarks. 以最适合的单位打印持续时间，精度为2。这是打印几乎所有基准测试时间的一种好方法。

Answer 3

Your code is stuck in an infinite loop. 您的代码陷入了无限循环。

The comparison i < NUM_LOOP will always return true since int i will wrap around before reaching NUM_LOOP 比较i < NUM_LOOP将始终返回true，因为int i将在达到NUM_LOOP之前NUM_LOOP

哪些优化技术应用于总结简单算术序列的Rust代码？

问题描述

3 个解决方案

解决方案1
7 2018-10-24 05:57:44

解决方案2
7 2018-10-24 12:31:34

Additional notes 补充笔记

解决方案3
5 2018-10-24 05:57:58

哪些优化技术应用于总结简单算术序列的Rust代码？

问题描述

3 个解决方案

解决方案1 7 2018-10-24 05:57:44

解决方案2 7 2018-10-24 12:31:34

Additional notes 补充笔记

解决方案3 5 2018-10-24 05:57:58

解决方案1
7 2018-10-24 05:57:44

解决方案2
7 2018-10-24 12:31:34

解决方案3
5 2018-10-24 05:57:58