评估Rust Book的基准示例时出现奇怪的结果

Question

Benchmark tests from Rust Book Rust Book的基准测试

I am getting odd results from microbenchmarks that can be reproduced with the the example bench_xor_1000_ints from the Rust Book chapter on Benchmark tests . 我从bench_xor_1000_ints 基准测试中得到了奇怪的结果，这些结果可以通过“ 基准测试 ”的“锈皮书”一章中的示例bench_xor_1000_ints进行复制。

The section Gotcha: optimizations gives general recommendations to return a value from the Bencher::iter closure and/or to use the black_box function. “获取：优化 ”部分提供了一些一般性建议，以从black_box Bencher::iter闭包中返回一个值和/或使用black_box函数。 I ended up with these five variants: 我最终得到了以下五个变体：

Benchmark code 基准代码

#![feature(test)]

extern crate test;

use test::{Bencher, black_box};

fn xor(x: i32, y: i32) -> i32 { x ^ y }

#[bench]
fn xor_closure_a(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, |a, b| a ^ b)
    });
}

#[bench]
fn xor_closure_b(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, |a, b| a ^ b);
    });
}

#[bench]
fn xor_pointer_a(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, xor)
    });
}

#[bench]
fn xor_pointer_b(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, xor);
    });
}

#[bench] // closure/pointer or semicolon doesn't affect results
fn xor_black_box(b: &mut Bencher) {
    b.iter(|| {
        black_box((0..1000).fold(0, xor));
    });
}

Benchmark naming 基准命名

prefix xor_closure_ : pass closure |a, b| a ^ b 前缀xor_closure_ ：通过闭包|a, b| a ^ b |a, b| a ^ b
prefix xor_pointer_ : pass function pointer xor 前缀xor_pointer_ ：传递函数指针xor
suffix _a : b.iter closure returns value (note lack of semicolon) 后缀_a ： b.iter闭包返回值（注意缺少分号）
suffix _b : b.iter closure doesn't return value (ends with semicolon) 后缀_b ： b.iter闭包不返回值（以分号结尾）
xor_black_box : throw everything into black_box and hope for the best xor_black_box ：将所有内容都放入black_box并希望得到最好的

Benchmark results 基准结果

test tests::xor_black_box ... bench:          69 ns/iter (+/- 1)
test tests::xor_closure_a ... bench:          70 ns/iter (+/- 1)
test tests::xor_closure_b ... bench:         921 ns/iter (+/- 5)
test tests::xor_pointer_a ... bench:          60 ns/iter (+/- 1)
test tests::xor_pointer_b ... bench:           0 ns/iter (+/- 0)

Some observations regarding measurements: 有关测量的一些观察结果：

Removal of the black-boxed iteration counter let n = black_box(1000) and inlining it into range (0..1000).fold ... doesn't affect any of the results 删除黑盒迭代计数器， let n = black_box(1000)并将其内联到范围(0..1000).fold ...不影响任何结果
Scaling of the iteration counter n scales measurements appropriately (except for xor_pointer_b optimized to 0) 迭代计数器n缩放比例会适当地缩放测量值（ xor_pointer_b优化为0除外）

On the whole, results agree with the general recommendations from Rust Book, except for xor_closure_b benchmark. 总体而言，除xor_closure_b基准测试外，结果与Rust Book的一般建议xor_closure_b 。

More specifically, results with the suffix _a (that return a value) seem to agree with the black-boxed xor_black_box which sounds good to me. 更具体地说，后缀_a （返回值）的结果似乎与黑框xor_black_box一致，这对我来说听起来不错。 And xor_pointer_b optimized to 0 seems legit since its closure doesn't return the value. 并且xor_pointer_b优化为0似乎是合法的，因为其关闭不会返回该值。 But the result of xor_closure_b is odd. 但是xor_closure_b的结果很奇怪。

Assembly code may clarify this question. 汇编代码可以澄清这个问题。 How can I prevent the Rust benchmark library from optimizing away my code? 如何防止Rust基准测试库优化我的代码？ gives a good overview on reading assembly from the Rust benchmark tests. 很好地概述了如何从Rust基准测试中读取汇编。

cargo rustc --release -- --emit asm

produces this output : 产生此输出：

    .text
    .file   "fpinrust.cgu-0.rs"
    .section    .rodata.cst16,"aM",@progbits,16
    .p2align    4
.LCPI0_0:
    .long   0
    .long   1
    .long   2
    .long   3
.LCPI0_1:
    .long   4
    .long   4
    .long   4
    .long   4
.LCPI0_2:
    .long   8
    .long   8
    .long   8
    .long   8
.LCPI0_3:
    .long   12
    .long   12
    .long   12
    .long   12
.LCPI0_4:
    .long   16
    .long   16
    .long   16
    .long   16
.LCPI0_5:
    .long   20
    .long   20
    .long   20
    .long   20
.LCPI0_6:
    .long   24
    .long   24
    .long   24
    .long   24
.LCPI0_7:
    .long   28
    .long   28
    .long   28
    .long   28
.LCPI0_8:
    .long   32
    .long   32
    .long   32
    .long   32
    .section    .text._ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E,@function
_ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E:
    .cfi_startproc
    pushq   %r14
.Ltmp0:
    .cfi_def_cfa_offset 16
    pushq   %rbx
.Ltmp1:
    .cfi_def_cfa_offset 24
    subq    $40, %rsp
.Ltmp2:
    .cfi_def_cfa_offset 64
.Ltmp3:
    .cfi_offset %rbx, -24
.Ltmp4:
    .cfi_offset %r14, -16
    movq    %rdi, %r14
    leaq    24(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%r14), %r9
    testq   %r9, %r9
    je  .LBB0_15
    xorl    %ecx, %ecx
    leaq    8(%rsp), %r8
    movdqa  .LCPI0_0(%rip), %xmm8
    movdqa  .LCPI0_1(%rip), %xmm15
    movdqa  .LCPI0_2(%rip), %xmm2
    movdqa  .LCPI0_3(%rip), %xmm9
    movdqa  .LCPI0_4(%rip), %xmm10
    movdqa  .LCPI0_5(%rip), %xmm11
    movdqa  .LCPI0_6(%rip), %xmm12
    movdqa  .LCPI0_7(%rip), %xmm13
    movdqa  .LCPI0_8(%rip), %xmm14
    .p2align    4, 0x90
.LBB0_2:
    incq    %rcx
    movl    $1000, 8(%rsp)
    #APP
    #NO_APP
    movl    8(%rsp), %esi
    testl   %esi, %esi
    movl    $0, %edx
    jle .LBB0_14
    xorl    %edx, %edx
    cmpl    $8, %esi
    jae .LBB0_5
    xorl    %edi, %edi
    jmp .LBB0_13
    .p2align    4, 0x90
.LBB0_5:
    movl    %esi, %eax
    andl    $-8, %eax
    movl    $0, %edi
    je  .LBB0_13
    leal    -8(%rax), %edx
    movl    %edx, %edi
    shrl    $3, %edi
    leal    1(%rdi), %ebx
    andl    $3, %ebx
    pxor    %xmm3, %xmm3
    cmpl    $24, %edx
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    jb  .LBB0_9
    leal    -1(%rbx), %edx
    subl    %edi, %edx
    pxor    %xmm3, %xmm3
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    .p2align    4, 0x90
.LBB0_8:
    movdqa  %xmm5, %xmm6
    paddd   %xmm15, %xmm6
    movdqa  %xmm5, %xmm7
    paddd   %xmm2, %xmm7
    pxor    %xmm5, %xmm3
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm4
    paddd   %xmm9, %xmm4
    movdqa  %xmm5, %xmm0
    paddd   %xmm10, %xmm0
    pxor    %xmm7, %xmm3
    movdqa  %xmm5, %xmm7
    paddd   %xmm11, %xmm7
    movdqa  %xmm5, %xmm1
    paddd   %xmm12, %xmm1
    pxor    %xmm4, %xmm7
    pxor    %xmm6, %xmm7
    movdqa  %xmm5, %xmm4
    paddd   %xmm13, %xmm4
    paddd   %xmm14, %xmm5
    pxor    %xmm0, %xmm1
    pxor    %xmm1, %xmm3
    pxor    %xmm7, %xmm4
    addl    $4, %edx
    jne .LBB0_8
.LBB0_9:
    testl   %ebx, %ebx
    je  .LBB0_12
    negl    %ebx
    .p2align    4, 0x90
.LBB0_11:
    movdqa  %xmm5, %xmm0
    pxor    %xmm5, %xmm3
    paddd   %xmm15, %xmm5
    paddd   %xmm2, %xmm0
    pxor    %xmm5, %xmm4
    incl    %ebx
    movdqa  %xmm0, %xmm5
    jne .LBB0_11
.LBB0_12:
    pxor    %xmm4, %xmm3
    pshufd  $78, %xmm3, %xmm0
    pxor    %xmm3, %xmm0
    pshufd  $229, %xmm0, %xmm1
    pxor    %xmm0, %xmm1
    movd    %xmm1, %edx
    cmpl    %eax, %esi
    movl    %eax, %edi
    je  .LBB0_14
    .p2align    4, 0x90
.LBB0_13:
    xorl    %edi, %edx
    leal    1(%rdi), %eax
    cmpl    %eax, %esi
    movl    %eax, %edi
    jne .LBB0_13
.LBB0_14:
    movl    %edx, 8(%rsp)
    #APP
    #NO_APP
    cmpq    %r9, %rcx
    jne .LBB0_2
.LBB0_15:
    leaq    8(%rsp), %rdi
    leaq    24(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    8(%rsp), %rax
    movl    16(%rsp), %ecx
    movq    %rax, 8(%r14)
    movl    %ecx, 16(%r14)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    retq
.Lfunc_end0:
    .size   _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E, .Lfunc_end0-_ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E
    .cfi_endproc

    .section    .text._ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E,@function
_ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E:
    .cfi_startproc
    pushq   %r14
.Ltmp5:
    .cfi_def_cfa_offset 16
    pushq   %rbx
.Ltmp6:
    .cfi_def_cfa_offset 24
    subq    $40, %rsp
.Ltmp7:
    .cfi_def_cfa_offset 64
.Ltmp8:
    .cfi_offset %rbx, -24
.Ltmp9:
    .cfi_offset %r14, -16
    movq    %rdi, %r14
    leaq    24(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%r14), %rax
    testq   %rax, %rax
    je  .LBB1_5
    xorl    %ecx, %ecx
    leaq    8(%rsp), %rdx
    .p2align    4, 0x90
.LBB1_2:
    movl    $1000, 8(%rsp)
    #APP
    #NO_APP
    movl    8(%rsp), %esi
    xorl    %ebx, %ebx
    .p2align    4, 0x90
.LBB1_3:
    xorl    %edi, %edi
    cmpl    %esi, %ebx
    setl    %dil
    addl    %ebx, %edi
    cmpl    %esi, %ebx
    movl    %edi, %ebx
    jl  .LBB1_3
    incq    %rcx
    #APP
    #NO_APP
    cmpq    %rax, %rcx
    jne .LBB1_2
.LBB1_5:
    leaq    8(%rsp), %rdi
    leaq    24(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    8(%rsp), %rax
    movl    16(%rsp), %ecx
    movq    %rax, 8(%r14)
    movl    %ecx, 16(%r14)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    retq
.Lfunc_end1:
    .size   _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E, .Lfunc_end1-_ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E
    .cfi_endproc

    .section    .rodata.cst16,"aM",@progbits,16
    .p2align    4
.LCPI2_0:
    .long   0
    .long   1
    .long   2
    .long   3
.LCPI2_1:
    .long   4
    .long   4
    .long   4
    .long   4
.LCPI2_2:
    .long   8
    .long   8
    .long   8
    .long   8
.LCPI2_3:
    .long   12
    .long   12
    .long   12
    .long   12
.LCPI2_4:
    .long   16
    .long   16
    .long   16
    .long   16
.LCPI2_5:
    .long   20
    .long   20
    .long   20
    .long   20
.LCPI2_6:
    .long   24
    .long   24
    .long   24
    .long   24
.LCPI2_7:
    .long   28
    .long   28
    .long   28
    .long   28
.LCPI2_8:
    .long   32
    .long   32
    .long   32
    .long   32
    .section    .text._ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E,@function
_ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E:
    .cfi_startproc
    pushq   %r14
.Ltmp10:
    .cfi_def_cfa_offset 16
    pushq   %rbx
.Ltmp11:
    .cfi_def_cfa_offset 24
    subq    $40, %rsp
.Ltmp12:
    .cfi_def_cfa_offset 64
.Ltmp13:
    .cfi_offset %rbx, -24
.Ltmp14:
    .cfi_offset %r14, -16
    movq    %rdi, %r14
    leaq    24(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%r14), %r9
    testq   %r9, %r9
    je  .LBB2_15
    xorl    %ecx, %ecx
    leaq    8(%rsp), %r8
    movdqa  .LCPI2_0(%rip), %xmm8
    movdqa  .LCPI2_1(%rip), %xmm15
    movdqa  .LCPI2_2(%rip), %xmm2
    movdqa  .LCPI2_3(%rip), %xmm9
    movdqa  .LCPI2_4(%rip), %xmm10
    movdqa  .LCPI2_5(%rip), %xmm11
    movdqa  .LCPI2_6(%rip), %xmm12
    movdqa  .LCPI2_7(%rip), %xmm13
    movdqa  .LCPI2_8(%rip), %xmm14
    .p2align    4, 0x90
.LBB2_2:
    incq    %rcx
    movl    $1000, 8(%rsp)
    #APP
    #NO_APP
    movl    8(%rsp), %esi
    testl   %esi, %esi
    movl    $0, %edx
    jle .LBB2_14
    xorl    %edx, %edx
    cmpl    $8, %esi
    jae .LBB2_5
    xorl    %edi, %edi
    jmp .LBB2_13
    .p2align    4, 0x90
.LBB2_5:
    movl    %esi, %eax
    andl    $-8, %eax
    movl    $0, %edi
    je  .LBB2_13
    leal    -8(%rax), %edx
    movl    %edx, %edi
    shrl    $3, %edi
    leal    1(%rdi), %ebx
    andl    $3, %ebx
    pxor    %xmm3, %xmm3
    cmpl    $24, %edx
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    jb  .LBB2_9
    leal    -1(%rbx), %edx
    subl    %edi, %edx
    pxor    %xmm3, %xmm3
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    .p2align    4, 0x90
.LBB2_8:
    movdqa  %xmm5, %xmm6
    paddd   %xmm15, %xmm6
    movdqa  %xmm5, %xmm7
    paddd   %xmm2, %xmm7
    pxor    %xmm5, %xmm3
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm4
    paddd   %xmm9, %xmm4
    movdqa  %xmm5, %xmm0
    paddd   %xmm10, %xmm0
    pxor    %xmm7, %xmm3
    movdqa  %xmm5, %xmm7
    paddd   %xmm11, %xmm7
    movdqa  %xmm5, %xmm1
    paddd   %xmm12, %xmm1
    pxor    %xmm4, %xmm7
    pxor    %xmm6, %xmm7
    movdqa  %xmm5, %xmm4
    paddd   %xmm13, %xmm4
    paddd   %xmm14, %xmm5
    pxor    %xmm0, %xmm1
    pxor    %xmm1, %xmm3
    pxor    %xmm7, %xmm4
    addl    $4, %edx
    jne .LBB2_8
.LBB2_9:
    testl   %ebx, %ebx
    je  .LBB2_12
    negl    %ebx
    .p2align    4, 0x90
.LBB2_11:
    movdqa  %xmm5, %xmm0
    pxor    %xmm5, %xmm3
    paddd   %xmm15, %xmm5
    paddd   %xmm2, %xmm0
    pxor    %xmm5, %xmm4
    incl    %ebx
    movdqa  %xmm0, %xmm5
    jne .LBB2_11
.LBB2_12:
    pxor    %xmm4, %xmm3
    pshufd  $78, %xmm3, %xmm0
    pxor    %xmm3, %xmm0
    pshufd  $229, %xmm0, %xmm1
    pxor    %xmm0, %xmm1
    movd    %xmm1, %edx
    cmpl    %eax, %esi
    movl    %eax, %edi
    je  .LBB2_14
    .p2align    4, 0x90
.LBB2_13:
    xorl    %edi, %edx
    leal    1(%rdi), %eax
    cmpl    %eax, %esi
    movl    %eax, %edi
    jne .LBB2_13
.LBB2_14:
    movl    %edx, 8(%rsp)
    #APP
    #NO_APP
    cmpq    %r9, %rcx
    jne .LBB2_2
.LBB2_15:
    leaq    8(%rsp), %rdi
    leaq    24(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    8(%rsp), %rax
    movl    16(%rsp), %ecx
    movq    %rax, 8(%r14)
    movl    %ecx, 16(%r14)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    retq
.Lfunc_end2:
    .size   _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E, .Lfunc_end2-_ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E
    .cfi_endproc

    .section    .text._ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E,@function
_ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E:
    .cfi_startproc
    pushq   %rbx
.Ltmp15:
    .cfi_def_cfa_offset 16
    subq    $32, %rsp
.Ltmp16:
    .cfi_def_cfa_offset 48
.Ltmp17:
    .cfi_offset %rbx, -16
    movq    %rdi, %rbx
    leaq    16(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%rbx), %rax
    testq   %rax, %rax
    je  .LBB3_3
    leaq    (%rsp), %rcx
    .p2align    4, 0x90
.LBB3_2:
    movl    $1000, (%rsp)
    #APP
    #NO_APP
    #APP
    #NO_APP
    decq    %rax
    jne .LBB3_2
.LBB3_3:
    leaq    (%rsp), %rdi
    leaq    16(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    (%rsp), %rax
    movl    8(%rsp), %ecx
    movq    %rax, 8(%rbx)
    movl    %ecx, 16(%rbx)
    addq    $32, %rsp
    popq    %rbx
    retq
.Lfunc_end3:
    .size   _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E, .Lfunc_end3-_ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E
    .cfi_endproc

    .section    .rodata.cst16,"aM",@progbits,16
    .p2align    4
.LCPI4_0:
    .long   0
    .long   1
    .long   2
    .long   3
.LCPI4_1:
    .long   4
    .long   4
    .long   4
    .long   4
.LCPI4_2:
    .long   8
    .long   8
    .long   8
    .long   8
.LCPI4_3:
    .long   12
    .long   12
    .long   12
    .long   12
.LCPI4_4:
    .long   16
    .long   16
    .long   16
    .long   16
.LCPI4_5:
    .long   20
    .long   20
    .long   20
    .long   20
.LCPI4_6:
    .long   24
    .long   24
    .long   24
    .long   24
.LCPI4_7:
    .long   28
    .long   28
    .long   28
    .long   28
.LCPI4_8:
    .long   32
    .long   32
    .long   32
    .long   32
.LCPI4_9:
    .long   36
    .long   36
    .long   36
    .long   36
.LCPI4_10:
    .long   40
    .long   40
    .long   40
    .long   40
    .section    .text._ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E,@function
_ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E:
    .cfi_startproc
    pushq   %rbx
.Ltmp18:
    .cfi_def_cfa_offset 16
    subq    $32, %rsp
.Ltmp19:
    .cfi_def_cfa_offset 48
.Ltmp20:
    .cfi_offset %rbx, -16
    movq    %rdi, %rbx
    leaq    16(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%rbx), %rax
    testq   %rax, %rax
    je  .LBB4_5
    xorl    %ecx, %ecx
    movdqa  .LCPI4_1(%rip), %xmm9
    movdqa  .LCPI4_2(%rip), %xmm10
    movdqa  .LCPI4_3(%rip), %xmm11
    movdqa  .LCPI4_4(%rip), %xmm12
    movdqa  .LCPI4_5(%rip), %xmm13
    movdqa  .LCPI4_6(%rip), %xmm14
    movdqa  .LCPI4_7(%rip), %xmm15
    movdqa  .LCPI4_8(%rip), %xmm0
    movdqa  .LCPI4_9(%rip), %xmm1
    movdqa  .LCPI4_10(%rip), %xmm2
    leaq    (%rsp), %rdx
    .p2align    4, 0x90
.LBB4_2:
    pxor    %xmm3, %xmm3
    movl    $1000, %esi
    pxor    %xmm4, %xmm4
    movdqa  .LCPI4_0(%rip), %xmm5
    .p2align    4, 0x90
.LBB4_3:
    movdqa  %xmm5, %xmm6
    paddd   %xmm9, %xmm6
    movdqa  %xmm5, %xmm7
    paddd   %xmm10, %xmm7
    pxor    %xmm5, %xmm3
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm4
    paddd   %xmm11, %xmm4
    pxor    %xmm7, %xmm3
    movdqa  %xmm5, %xmm7
    paddd   %xmm13, %xmm7
    pxor    %xmm4, %xmm7
    movdqa  %xmm5, %xmm4
    paddd   %xmm12, %xmm4
    pxor    %xmm6, %xmm7
    movdqa  %xmm5, %xmm6
    paddd   %xmm14, %xmm6
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm8
    paddd   %xmm15, %xmm8
    pxor    %xmm6, %xmm3
    movdqa  %xmm5, %xmm4
    paddd   %xmm0, %xmm4
    pxor    %xmm4, %xmm3
    movdqa  %xmm5, %xmm4
    paddd   %xmm1, %xmm4
    pxor    %xmm8, %xmm4
    pxor    %xmm7, %xmm4
    paddd   %xmm2, %xmm5
    addl    $-40, %esi
    jne .LBB4_3
    pxor    %xmm3, %xmm4
    pshufd  $78, %xmm4, %xmm3
    pxor    %xmm4, %xmm3
    pshufd  $229, %xmm3, %xmm4
    pxor    %xmm3, %xmm4
    incq    %rcx
    movd    %xmm4, (%rsp)
    #APP
    #NO_APP
    #APP
    #NO_APP
    cmpq    %rax, %rcx
    jne .LBB4_2
.LBB4_5:
    leaq    (%rsp), %rdi
    leaq    16(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    (%rsp), %rax
    movl    8(%rsp), %ecx
    movq    %rax, 8(%rbx)
    movl    %ecx, 16(%rbx)
    addq    $32, %rsp
    popq    %rbx
    retq
.Lfunc_end4:
    .size   _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E, .Lfunc_end4-_ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E
    .cfi_endproc


    .section    ".note.GNU-stack","",@progbits

Looking at the assembly, xor_closure_a , xor_pointer_a and xor_black_box which all have relatively the same performance, evaluate to similar assembly code. 从汇编的xor_closure_a ，性能相对相同的xor_closure_a ， xor_pointer_a和xor_black_box评估为相似的汇编代码。 Which is, by the way, a lot more instructions than the amount of ASM for poor xor_closure_b . 顺便说一句，对于较差的xor_closure_b ，这比ASM的数量xor_closure_b 。

This is where my research ends. 这就是我研究的终点。 I would be glad if someone explains, why this blanket of assembly produced by xor_black_box evaluates faster than xor_closure_b . 如果有人解释，我将很高兴，为什么xor_black_box产生的装配xor_black_box比xor_closure_b评估得更快。 Or put it differently, why the version that should be optimized away by the compiler runs notably slower, and which benchmark variant I should trust? 或换句话说，为什么应该由编译器优化的版本运行速度明显慢，我应该信任哪个基准测试变体？

Versions 版本号

rustc --version
rustc 1.13.0-nightly (378195665 2016-09-08)

cargo --version
cargo 0.13.0-nightly (afaffa1 2016-09-06)

Update 更新资料

@Francis Gagné did a fantastic job to investigate the issue in his answer. @FrancisGagné在回答问题时做得非常出色。 The steps that took me to reproduce his results are listed below. 下面列出了促使我重现他的结果的步骤。

Compile benchmarks and emit assembly: 编译基准并发出汇编：

cargo rustc --release -- --test --emit asm

Run produced executable to verify previous behavior: 运行生成的可执行文件以验证以前的行为：

./target/release/deps/xor --bench

running 2 tests
test tests::xor_closure_b ... bench:         925 ns/iter (+/- 9)
test tests::xor_pointer_b ... bench:           0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

Compile patched assembly output ./target/release/deps/xor.s 编译修补的程序集输出./target/release/deps/xor.s

export RUSTLIB=/path/to/lib/rustlib
gcc target/release/deps/xor.s $RUSTLIB/x86_64-unknown-linux-gnu/lib/*.rlib $RUSTLIB/x86_64-unknown-linux-gnu/lib/*.so -pthread -lpthread -lm -ldl

Run updated benchmarks: 运行更新的基准测试：

./a.out --bench

running 2 tests
test tests::xor_closure_b ... bench:           1 ns/iter (+/- 0)
test tests::xor_pointer_b ... bench:           0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

Answer 1

I'll focus on comparing xor_closure_b and xor_pointer_b , since they should have similar performance (ie they should both do nothing). 我将重点比较xor_closure_b和xor_pointer_b ，因为它们应该具有相似的性能（即它们都不应该执行任何操作）。

Update: I made a mistake in my initial analysis, as pointed out by @EOF , so I revised the text below. 更新：正如@EOF指出的那样，我在最初的分析中犯了一个错误，因此我修改了以下文本。

First, let's look at the LLVM IR generated for those two functions. 首先，让我们看一下为这两个函数生成的LLVM IR。 (I find LLVM IR easier to read than ASM because it's more structured.) （我发现LLVM IR比ASM更易于阅读，因为它的结构更加合理。）

; Function Attrs: uwtable
define internal void @_ZN3xor13xor_closure_b17hb13913a8d2a27b06E(%"11.test::Bencher"* nocapture dereferenceable(32)) unnamed_addr #0 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %dummy.i.i = alloca {}, align 8
  %dummy.i.i.i = alloca i32, align 4
  %start1.i = alloca %"1.std::time::Instant", align 8
  %tmp_ret2.i = alloca %"1.std::time::Duration", align 8
  %1 = bitcast %"1.std::time::Duration"* %tmp_ret2.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %1)
  %2 = bitcast %"1.std::time::Instant"* %start1.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %2)
  call void @_ZN3std4time7Instant3now17h37bccd496c61083dE(%"1.std::time::Instant"* noalias nocapture nonnull sret dereferenceable(16) %start1.i)
  %3 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 0
  %4 = load i64, i64* %3, align 8
  %5 = icmp eq i64 %4, 0
  br i1 %5, label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit, label %bb7.lr.ph.i

bb7.lr.ph.i:                                      ; preds = %entry-block
  %6 = bitcast i32* %dummy.i.i.i to i8*
  %7 = bitcast {}* %dummy.i.i to i8*
  br label %bb7.i

bb7.i:                                            ; preds = %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i", %bb7.lr.ph.i
  %iter.sroa.0.019.i = phi i64 [ 0, %bb7.lr.ph.i ], [ %11, %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i" ]
  call void @llvm.lifetime.start(i64 4, i8* %6) #2
  store i32 1000, i32* %dummy.i.i.i, align 4
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* nonnull %dummy.i.i.i) #2, !srcloc !1
  %8 = load i32, i32* %dummy.i.i.i, align 4
  call void @llvm.lifetime.end(i64 4, i8* %6) #2
  br label %bb7.i.i.i

bb7.i.i.i:                                        ; preds = %bb7.i.i.i, %bb7.i
  %iter.sroa.0.0.i.i.i = phi i32 [ 0, %bb7.i ], [ %iter.sroa.0.1.i.i.i, %bb7.i.i.i ]
  %9 = icmp slt i32 %iter.sroa.0.0.i.i.i, %8
  %10 = zext i1 %9 to i32
  %iter.sroa.0.1.i.i.i = add i32 %10, %iter.sroa.0.0.i.i.i
  br i1 %9, label %bb7.i.i.i, label %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"

"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i": ; preds = %bb7.i.i.i
  %11 = add nuw i64 %iter.sroa.0.019.i, 1
  call void @llvm.lifetime.start(i64 0, i8* %7)
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"({}* nonnull %dummy.i.i) #2, !srcloc !1
  call void @llvm.lifetime.end(i64 0, i8* %7)
  %exitcond.i = icmp eq i64 %11, %4
  br i1 %exitcond.i, label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit, label %bb7.i

_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit: ; preds = %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"
  br label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit

_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit:   ; preds = %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit, %entry-block
  call void @_ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE(%"1.std::time::Duration"* noalias nocapture nonnull sret dereferenceable(16) %tmp_ret2.i, %"1.std::time::Instant"* noalias nonnull readonly dereferenceable(16) %start1.i)
  %12 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 0
  %13 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 1
  %14 = load i64, i64* %12, align 8
  %15 = load i32, i32* %13, align 8
  %16 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 0
  store i64 %14, i64* %16, align 8
  %17 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 1
  store i32 %15, i32* %17, align 4
  call void @llvm.lifetime.end(i64 16, i8* %2)
  call void @llvm.lifetime.end(i64 16, i8* %1)
  ret void
}

; Function Attrs: uwtable
define internal void @_ZN3xor13xor_pointer_b17h7ba0f9760d9fd9f8E(%"11.test::Bencher"* nocapture dereferenceable(32)) unnamed_addr #0 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %dummy.i.i = alloca {}, align 8
  %dummy.i.i.i = alloca i32, align 4
  %start1.i = alloca %"1.std::time::Instant", align 8
  %tmp_ret2.i = alloca %"1.std::time::Duration", align 8
  %1 = bitcast %"1.std::time::Duration"* %tmp_ret2.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %1)
  %2 = bitcast %"1.std::time::Instant"* %start1.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %2)
  call void @_ZN3std4time7Instant3now17h37bccd496c61083dE(%"1.std::time::Instant"* noalias nocapture nonnull sret dereferenceable(16) %start1.i)
  %3 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 0
  %4 = load i64, i64* %3, align 8
  %5 = icmp eq i64 %4, 0
  br i1 %5, label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit, label %bb7.lr.ph.i

bb7.lr.ph.i:                                      ; preds = %entry-block
  %6 = bitcast i32* %dummy.i.i.i to i8*
  %7 = bitcast {}* %dummy.i.i to i8*
  br label %bb7.i

bb7.i:                                            ; preds = %bb7.i, %bb7.lr.ph.i
  %iter.sroa.0.019.i = phi i64 [ 0, %bb7.lr.ph.i ], [ %8, %bb7.i ]
  %8 = add nuw i64 %iter.sroa.0.019.i, 1
  call void @llvm.lifetime.start(i64 4, i8* %6) #2
  store i32 1000, i32* %dummy.i.i.i, align 4
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* nonnull %dummy.i.i.i) #2, !srcloc !1
  call void @llvm.lifetime.end(i64 4, i8* %6) #2
  call void @llvm.lifetime.start(i64 0, i8* %7)
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"({}* nonnull %dummy.i.i) #2, !srcloc !1
  call void @llvm.lifetime.end(i64 0, i8* %7)
  %exitcond.i = icmp eq i64 %8, %4
  br i1 %exitcond.i, label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit, label %bb7.i

_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit: ; preds = %bb7.i
  br label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit

_ZN4test7Bencher4iter17hae343b1316e5897bE.exit:   ; preds = %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit, %entry-block
  call void @_ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE(%"1.std::time::Duration"* noalias nocapture nonnull sret dereferenceable(16) %tmp_ret2.i, %"1.std::time::Instant"* noalias nonnull readonly dereferenceable(16) %start1.i)
  %9 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 0
  %10 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 1
  %11 = load i64, i64* %9, align 8
  %12 = load i32, i32* %10, align 8
  %13 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 0
  store i64 %11, i64* %13, align 8
  %14 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 1
  store i32 %12, i32* %14, align 4
  call void @llvm.lifetime.end(i64 16, i8* %2)
  call void @llvm.lifetime.end(i64 16, i8* %1)
  ret void
}

If we compare the LLVM IR for xor_closure_b and xor_pointer_b , they look pretty similar. 如果比较xor_closure_b和xor_pointer_b的LLVM IR，它们看起来很相似。 However, one difference stands out: the bb7.iii block has been optimized out in xor_pointer_b , but not in xor_closure_b , for some reason. 但是，有一个区别是突出的：出于某种原因， bb7.iii块已在xor_pointer_b进行了优化，但在xor_closure_b未进行优化。 Here's the block: 这是块：

bb7.i.i.i:                                        ; preds = %bb7.i.i.i, %bb7.i
  %iter.sroa.0.0.i.i.i = phi i32 [ 0, %bb7.i ], [ %iter.sroa.0.1.i.i.i, %bb7.i.i.i ]
  %9 = icmp slt i32 %iter.sroa.0.0.i.i.i, %8
  %10 = zext i1 %9 to i32
  %iter.sroa.0.1.i.i.i = add i32 %10, %iter.sroa.0.0.i.i.i
  br i1 %9, label %bb7.i.i.i, label %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"

And here it is translated to ASM: 此处将其翻译为ASM：

.LBB1_3:
    xorl    %edi, %edi
    cmpl    %esi, %ebx
    setl    %dil
    addl    %ebx, %edi
    cmpl    %esi, %ebx
    movl    %edi, %ebx
    jl  .LBB1_3

This is a pretty dumb way to loop from 0 to 1000. Changing the code above to this: 这是一种从0到1000循环的相当愚蠢的方法。将上面的代码更改为此：

.LBB1_3:
    cmpl    %esi, %ebx
    jge .LBB1_3a
    incl    %ebx
    cmpl    %esi, %ebx
    jl  .LBB1_3
.LBB1_3a:

makes the benchmark for xor_closure_b drop from 781 ns/iter (+/- 19) to 270 ns/iter (+/- 7) on my machine. 使xor_closure_b的基准在我的计算机上从781 ns/iter (+/- 19)降至270 ns/iter (+/- 7) 。

I cannot say for sure why the code generated by the compiler is so slow, or why it wasn't optimized away in the first place (as it was in xor_pointer_b )... However, it seems that xor_pointer_a and xor_closure_a and faster still because the generated code is vectorized, which has the effect that the loop performs fewer iterations (ie the loop is unrolled), by a factor of 32 (eg .LBB0_8 , the main loop in xor_closure_a , performs 31 iterations, then the rest is handled after the loop). 我不能肯定地说为什么编译器生成的代码这么慢，或者为什么没有首先对其进行优化（就像在xor_pointer_b ）……但是，似乎xor_pointer_a和xor_closure_a甚至更快，仍然是因为生成的代码被矢量，其具有循环执行更少的迭代（即，循环被展开）中，由32（例如一个因素的影响.LBB0_8 ，在主回路xor_closure_a ，进行31次重复，则其余部分被后处理循环）。

For reference, I compiled the edited ASM with this command line: 作为参考，我使用以下命令行编译了已编辑的ASM：

$ gcc target/release/xor-71758a2519026d86.s ~/.multirust/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/lib{test,term,getopts,rustc_unicode,std,libc,rand,collections,alloc_system,alloc,core,panic_unwind}-411f48d3.rlib -pthread -lpthread -lm -ldl

and I ran it with ./a.out --bench . 我用./a.out --bench运行它。 Also, my CPU is an Intel Core i7-4770K. 另外，我的CPU是Intel Core i7-4770K。

评估Rust Book的基准示例时出现奇怪的结果

问题描述

Benchmark tests from Rust Book Rust Book的基准测试

Benchmark code 基准代码

Benchmark naming 基准命名

Benchmark results 基准结果

Versions 版本号

Update 更新资料

1 个解决方案

解决方案1
4 已采纳 2016-09-12 00:00:40

评估Rust Book的基准示例时出现奇怪的结果

问题描述

Benchmark tests from Rust Book Rust Book的基准测试

Benchmark code 基准代码

Benchmark naming 基准命名

Benchmark results 基准结果

Versions 版本号

Update 更新资料

1 个解决方案

解决方案1 4 已采纳 2016-09-12 00:00:40

解决方案1
4 已采纳 2016-09-12 00:00:40