简体   繁体   English

为什么在使用240个或更多元素循环数组时会产生很大的性能影响?

[英]Why is there a large performance impact when looping over an array with 240 or more elements?

When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster. 当在Rust中的数组上运行求和循环时,我注意到当CAPACITY > = 240时性能下降很大CAPACITY = 239大约快80倍。

Is there special compilation optimization Rust is doing for "short" arrays? 是否有特殊的编译优化Rust正在为“短”数组做什么?

Compiled with rustc -C opt-level=3 . rustc -C opt-level=3编译。

use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main() {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }
    let mut sum = 0;
    let now = Instant::now();
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }
    println!("sum:{} time:{:?}", sum, now.elapsed());
}

Summary : below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark. 总结 :在240以下,LLVM完全展开内循环,让它注意到它可以优化重复循环,打破你的基准。



You found a magic threshold above which LLVM stops performing certain optimizations . 您找到了一个魔术阈值,高于该阈值LLVM停止执行某些优化 The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usize s, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). 阈值是8字节* 240 = 1920字节(您的数组是usize的数组,因此假设x86-64 CPU,长度乘以8字节)。 In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. 在这个基准测试中,一个特定的优化 - 仅对长度239执行 - 是造成巨大速度差异的原因。 But let's start slowly: 但让我们慢慢开始:

(All code in this answer is compiled with -C opt-level=3 ) (此答案中的所有代码都是使用-C opt-level=3编译的)

pub fn foo() -> usize {
    let arr = [0; 240];
    let mut s = 0;
    for i in 0..arr.len() {
        s += arr[i];
    }
    s
}

This simple code will produce roughly the assembly one would expect: a loop adding up elements. 这个简单的代码将大致产生一个人们期望的组件:一个循环添加元素。 However, if you change 240 to 239 , the emitted assembly differs quite a lot. 但是,如果将240更改为239 ,则发出的组件会有很大差异。 See it on Godbolt Compiler Explorer . 在Godbolt Compiler Explorer上看到它 Here is a small part of the assembly: 这是装配的一小部分:

movdqa  xmm1, xmmword ptr [rsp + 32]
movdqa  xmm0, xmmword ptr [rsp + 48]
paddq   xmm1, xmmword ptr [rsp]
paddq   xmm0, xmmword ptr [rsp + 16]
paddq   xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq   xmm0, xmmword ptr [rsp + 1840]
paddq   xmm1, xmmword ptr [rsp + 1856]
paddq   xmm0, xmmword ptr [rsp + 1872]
paddq   xmm0, xmm1
pshufd  xmm1, xmm0, 78
paddq   xmm1, xmm0

This is what's called loop unrolling : LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", ie incrementing the loop variable, check if the loop has ended and the jump to the start of the loop. 这就是所谓的循环展开 :LLVM粘贴循环体一堆时间,以避免必须执行所有那些“循环管理指令”,即递增循环变量,检查循环是否已经结束以及跳转到循环的开始。

In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. 如果你想知道: paddq和类似的指令是SIMD指令,它允许并行地汇总多个值。 Moreover, two 16-byte SIMD registers ( xmm0 and xmm1 ) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. 此外,并行使用两个16字节SIMD寄存器( xmm0xmm1 ),以便CPU的指令级并行性基本上可以同时执行这些指令中的两个。 After all, they are independent of one another. 毕竟,他们是彼此独立的。 In the end, both registers are added together and then horizontally summed down to the scalar result. 最后,将两个寄存器相加,然后水平求和到标量结果。

Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. 现代主流x86 CPU(不是低功耗Atom)在L1d缓存中实际上每个时钟可以执行2个向量加载,并且paddq吞吐量每个时钟至少2个,在大多数CPU上有1个周期延迟。 See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead. 请参阅https://agner.org/optimize/以及有关多个累加器的问答,以隐藏延迟(针对点积的FP FMA)和吞吐量的瓶颈。

LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. LLVM在未完全展开时会展开一些小循环,并且仍使用多个累加器。 So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling. 通常,即使没有完全展开,前端带宽和后端延迟瓶颈对于LLVM生成的循环也不是一个大问题。


But loop unrolling is not responsible for a performance difference of factor 80! 但是,循环展开不对因子80的性能差异负责! At least not loop unrolling alone. 至少不是单独循环展开。 Let's take a look at the actual benchmarking code, which puts the one loop inside another one: 让我们看看实际的基准测试代码,它将一个循环放在另一个循环中:

const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }

    let mut sum = 0;
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }

    sum
}

( On Godbolt Compiler Explorer ) 在Godbolt Compiler Explorer上

The assembly for CAPACITY = 240 looks normal: two nested loops. CAPACITY = 240的程序集看起来很正常:两个嵌套循环。 (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! (在函数的开头有一些代码只是用于初始化,我们将忽略它。)然而,对于239,它看起来非常不同! We see that the initializing loop and the inner loop got unrolled: so far so expected. 我们看到初始化循环和内循环已展开:到目前为止如此预期。

The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! 重要的区别是,对于239,LLVM能够弄清楚内部循环的结果不依赖于外部循环! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times! 因此,LLVM发出的代码基本上首先只执行内部循环(计算总和),然后通过多次加sum模拟外部循环!

First we see almost the same assembly as above (the assembly representing the inner loop). 首先,我们看到几乎与上面相同的程序集(代表内部循环的程序集)。 Afterwards we see this (I commented to explain the assembly; the comments with * are especially important): 之后我们看到了这个(我评论说明了汇编;带*的评论特别重要):

        ; at the start of the function, `rbx` was set to 0

        movq    rax, xmm1     ; result of SIMD summing up stored in `rax`
        add     rax, 711      ; add up missing terms from loop unrolling
        mov     ecx, 500000   ; * init loop variable outer loop
.LBB0_1:
        add     rbx, rax      ; * rbx += rax
        add     rcx, -1       ; * decrement loop variable
        jne     .LBB0_1       ; * if loop variable != 0 jump to LBB0_1
        mov     rax, rbx      ; move rbx (the sum) back to rax
        ; two unimportant instructions omitted
        ret                   ; the return value is stored in `rax`

As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. 正如你在这里看到的那样,内循环的结果被采用,就像外循环运行然后返回一样。 LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one. LLVM只能执行此优化,因为它了解内部循环独立于外部循环。

This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS . 这意味着运行时从CAPACITY * IN_LOOPS更改为CAPACITY + IN_LOOPS And this is responsible for the huge performance difference. 这是造成巨大性能差异的原因。


An additional note: can you do anything about this? 另外一个注意事项:你能对此做些什么吗? Not really. 并不是的。 LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. LLVM必须具有这样的魔术阈值,如果没有它们,LLVM优化可能需要永远完成某些代码。 But we can also agree that this code was highly artificial. 但我们也同意这段代码非常人为。 In practice, I doubt that such a huge difference would occur. 在实践中,我怀疑会发生如此巨大的差异。 The difference due to full loop unrolling is usually not even factor 2 in these cases. 在这些情况下,由于完整循环展开的差异通常不是因子2。 So no need to worry about real use cases. 所以不必担心真实的用例。

As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. 作为关于惯用Rust代码的最后一个注释: arr.iter().sum()是一种更好的方法来总结数组的所有元素。 And changing this in the second example does not lead to any notable differences in emitted assembly. 在第二个示例中更改此设置不会导致发出的组件有任何显着差异。 You should use short and idiomatic versions unless you measured that it hurts performance. 您应该使用简短版本和惯用版本,除非您测量它会损害性能。

In addition to Lukas' answer, if you want to use an iterator, try this: 除了Lukas的回答,如果你想使用迭代器,试试这个:

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize {
    (0..CAPACITY).sum::<usize>() * IN_LOOPS
}

Thanks @Chris Morgan for the suggestion about range pattern. 感谢@Chris Morgan关于范围模式的建议。

The optimized assembly is quite good: 优化的组装非常好:

example::bar:
        movabs  rax, 14340000000
        ret

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM