简体   繁体   English

评估Rust Book的基准示例时出现奇怪的结果

[英]Odd results when evaluating benchmark example from Rust Book

Benchmark tests from Rust Book Rust Book的基准测试

I am getting odd results from microbenchmarks that can be reproduced with the the example bench_xor_1000_ints from the Rust Book chapter on Benchmark tests . 我从bench_xor_1000_ints 基准测试中得到了奇怪的结果,这些结果可以通过“ 基准测试 ”的“锈皮书”一章中的示例bench_xor_1000_ints进行复制。

The section Gotcha: optimizations gives general recommendations to return a value from the Bencher::iter closure and/or to use the black_box function. “获取:优化 ”部分提供了一些一般性建议,以从black_box Bencher::iter闭包中返回一个值和/或使用black_box函数。 I ended up with these five variants: 我最终得到了以下五个变体:

Benchmark code 基准代码

#![feature(test)]

extern crate test;

use test::{Bencher, black_box};

fn xor(x: i32, y: i32) -> i32 { x ^ y }

#[bench]
fn xor_closure_a(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, |a, b| a ^ b)
    });
}

#[bench]
fn xor_closure_b(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, |a, b| a ^ b);
    });
}

#[bench]
fn xor_pointer_a(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, xor)
    });
}

#[bench]
fn xor_pointer_b(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(1000);
        (0..n).fold(0, xor);
    });
}

#[bench] // closure/pointer or semicolon doesn't affect results
fn xor_black_box(b: &mut Bencher) {
    b.iter(|| {
        black_box((0..1000).fold(0, xor));
    });
}

Benchmark naming 基准命名

  • prefix xor_closure_ : pass closure |a, b| a ^ b 前缀xor_closure_ :通过闭包|a, b| a ^ b |a, b| a ^ b
  • prefix xor_pointer_ : pass function pointer xor 前缀xor_pointer_ :传递函数指针xor
  • suffix _a : b.iter closure returns value (note lack of semicolon) 后缀_ab.iter闭包返回值(注意缺少分号)
  • suffix _b : b.iter closure doesn't return value (ends with semicolon) 后缀_bb.iter闭包不返回值(以分号结尾)
  • xor_black_box : throw everything into black_box and hope for the best xor_black_box :将所有内容都放入black_box并希望得到最好的

Benchmark results 基准结果

test tests::xor_black_box ... bench:          69 ns/iter (+/- 1)
test tests::xor_closure_a ... bench:          70 ns/iter (+/- 1)
test tests::xor_closure_b ... bench:         921 ns/iter (+/- 5)
test tests::xor_pointer_a ... bench:          60 ns/iter (+/- 1)
test tests::xor_pointer_b ... bench:           0 ns/iter (+/- 0)

Some observations regarding measurements: 有关测量的一些观察结果:

  • Removal of the black-boxed iteration counter let n = black_box(1000) and inlining it into range (0..1000).fold ... doesn't affect any of the results 删除黑盒迭代计数器, let n = black_box(1000)并将其内联到范围(0..1000).fold ...不影响任何结果
  • Scaling of the iteration counter n scales measurements appropriately (except for xor_pointer_b optimized to 0) 迭代计数器n缩放比例会适当地缩放测量值( xor_pointer_b优化为0除外)

On the whole, results agree with the general recommendations from Rust Book, except for xor_closure_b benchmark. 总体而言,除xor_closure_b基准测试外,结果与Rust Book的一般建议xor_closure_b

More specifically, results with the suffix _a (that return a value) seem to agree with the black-boxed xor_black_box which sounds good to me. 更具体地说,后缀_a (返回值)的结果似乎与黑框xor_black_box一致,这对我来说听起来不错。 And xor_pointer_b optimized to 0 seems legit since its closure doesn't return the value. 并且xor_pointer_b优化为0似乎是合法的,因为其关闭不会返回该值。 But the result of xor_closure_b is odd. 但是xor_closure_b的结果很奇怪。

Assembly code may clarify this question. 汇编代码可以澄清这个问题。 How can I prevent the Rust benchmark library from optimizing away my code? 如何防止Rust基准测试库优化我的代码? gives a good overview on reading assembly from the Rust benchmark tests. 很好地概述了如何从Rust基准测试中读取汇编。

cargo rustc --release -- --emit asm

produces this output : 产生此输出

    .text
    .file   "fpinrust.cgu-0.rs"
    .section    .rodata.cst16,"aM",@progbits,16
    .p2align    4
.LCPI0_0:
    .long   0
    .long   1
    .long   2
    .long   3
.LCPI0_1:
    .long   4
    .long   4
    .long   4
    .long   4
.LCPI0_2:
    .long   8
    .long   8
    .long   8
    .long   8
.LCPI0_3:
    .long   12
    .long   12
    .long   12
    .long   12
.LCPI0_4:
    .long   16
    .long   16
    .long   16
    .long   16
.LCPI0_5:
    .long   20
    .long   20
    .long   20
    .long   20
.LCPI0_6:
    .long   24
    .long   24
    .long   24
    .long   24
.LCPI0_7:
    .long   28
    .long   28
    .long   28
    .long   28
.LCPI0_8:
    .long   32
    .long   32
    .long   32
    .long   32
    .section    .text._ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E,@function
_ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E:
    .cfi_startproc
    pushq   %r14
.Ltmp0:
    .cfi_def_cfa_offset 16
    pushq   %rbx
.Ltmp1:
    .cfi_def_cfa_offset 24
    subq    $40, %rsp
.Ltmp2:
    .cfi_def_cfa_offset 64
.Ltmp3:
    .cfi_offset %rbx, -24
.Ltmp4:
    .cfi_offset %r14, -16
    movq    %rdi, %r14
    leaq    24(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%r14), %r9
    testq   %r9, %r9
    je  .LBB0_15
    xorl    %ecx, %ecx
    leaq    8(%rsp), %r8
    movdqa  .LCPI0_0(%rip), %xmm8
    movdqa  .LCPI0_1(%rip), %xmm15
    movdqa  .LCPI0_2(%rip), %xmm2
    movdqa  .LCPI0_3(%rip), %xmm9
    movdqa  .LCPI0_4(%rip), %xmm10
    movdqa  .LCPI0_5(%rip), %xmm11
    movdqa  .LCPI0_6(%rip), %xmm12
    movdqa  .LCPI0_7(%rip), %xmm13
    movdqa  .LCPI0_8(%rip), %xmm14
    .p2align    4, 0x90
.LBB0_2:
    incq    %rcx
    movl    $1000, 8(%rsp)
    #APP
    #NO_APP
    movl    8(%rsp), %esi
    testl   %esi, %esi
    movl    $0, %edx
    jle .LBB0_14
    xorl    %edx, %edx
    cmpl    $8, %esi
    jae .LBB0_5
    xorl    %edi, %edi
    jmp .LBB0_13
    .p2align    4, 0x90
.LBB0_5:
    movl    %esi, %eax
    andl    $-8, %eax
    movl    $0, %edi
    je  .LBB0_13
    leal    -8(%rax), %edx
    movl    %edx, %edi
    shrl    $3, %edi
    leal    1(%rdi), %ebx
    andl    $3, %ebx
    pxor    %xmm3, %xmm3
    cmpl    $24, %edx
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    jb  .LBB0_9
    leal    -1(%rbx), %edx
    subl    %edi, %edx
    pxor    %xmm3, %xmm3
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    .p2align    4, 0x90
.LBB0_8:
    movdqa  %xmm5, %xmm6
    paddd   %xmm15, %xmm6
    movdqa  %xmm5, %xmm7
    paddd   %xmm2, %xmm7
    pxor    %xmm5, %xmm3
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm4
    paddd   %xmm9, %xmm4
    movdqa  %xmm5, %xmm0
    paddd   %xmm10, %xmm0
    pxor    %xmm7, %xmm3
    movdqa  %xmm5, %xmm7
    paddd   %xmm11, %xmm7
    movdqa  %xmm5, %xmm1
    paddd   %xmm12, %xmm1
    pxor    %xmm4, %xmm7
    pxor    %xmm6, %xmm7
    movdqa  %xmm5, %xmm4
    paddd   %xmm13, %xmm4
    paddd   %xmm14, %xmm5
    pxor    %xmm0, %xmm1
    pxor    %xmm1, %xmm3
    pxor    %xmm7, %xmm4
    addl    $4, %edx
    jne .LBB0_8
.LBB0_9:
    testl   %ebx, %ebx
    je  .LBB0_12
    negl    %ebx
    .p2align    4, 0x90
.LBB0_11:
    movdqa  %xmm5, %xmm0
    pxor    %xmm5, %xmm3
    paddd   %xmm15, %xmm5
    paddd   %xmm2, %xmm0
    pxor    %xmm5, %xmm4
    incl    %ebx
    movdqa  %xmm0, %xmm5
    jne .LBB0_11
.LBB0_12:
    pxor    %xmm4, %xmm3
    pshufd  $78, %xmm3, %xmm0
    pxor    %xmm3, %xmm0
    pshufd  $229, %xmm0, %xmm1
    pxor    %xmm0, %xmm1
    movd    %xmm1, %edx
    cmpl    %eax, %esi
    movl    %eax, %edi
    je  .LBB0_14
    .p2align    4, 0x90
.LBB0_13:
    xorl    %edi, %edx
    leal    1(%rdi), %eax
    cmpl    %eax, %esi
    movl    %eax, %edi
    jne .LBB0_13
.LBB0_14:
    movl    %edx, 8(%rsp)
    #APP
    #NO_APP
    cmpq    %r9, %rcx
    jne .LBB0_2
.LBB0_15:
    leaq    8(%rsp), %rdi
    leaq    24(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    8(%rsp), %rax
    movl    16(%rsp), %ecx
    movq    %rax, 8(%r14)
    movl    %ecx, 16(%r14)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    retq
.Lfunc_end0:
    .size   _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E, .Lfunc_end0-_ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E
    .cfi_endproc

    .section    .text._ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E,@function
_ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E:
    .cfi_startproc
    pushq   %r14
.Ltmp5:
    .cfi_def_cfa_offset 16
    pushq   %rbx
.Ltmp6:
    .cfi_def_cfa_offset 24
    subq    $40, %rsp
.Ltmp7:
    .cfi_def_cfa_offset 64
.Ltmp8:
    .cfi_offset %rbx, -24
.Ltmp9:
    .cfi_offset %r14, -16
    movq    %rdi, %r14
    leaq    24(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%r14), %rax
    testq   %rax, %rax
    je  .LBB1_5
    xorl    %ecx, %ecx
    leaq    8(%rsp), %rdx
    .p2align    4, 0x90
.LBB1_2:
    movl    $1000, 8(%rsp)
    #APP
    #NO_APP
    movl    8(%rsp), %esi
    xorl    %ebx, %ebx
    .p2align    4, 0x90
.LBB1_3:
    xorl    %edi, %edi
    cmpl    %esi, %ebx
    setl    %dil
    addl    %ebx, %edi
    cmpl    %esi, %ebx
    movl    %edi, %ebx
    jl  .LBB1_3
    incq    %rcx
    #APP
    #NO_APP
    cmpq    %rax, %rcx
    jne .LBB1_2
.LBB1_5:
    leaq    8(%rsp), %rdi
    leaq    24(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    8(%rsp), %rax
    movl    16(%rsp), %ecx
    movq    %rax, 8(%r14)
    movl    %ecx, 16(%r14)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    retq
.Lfunc_end1:
    .size   _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E, .Lfunc_end1-_ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E
    .cfi_endproc

    .section    .rodata.cst16,"aM",@progbits,16
    .p2align    4
.LCPI2_0:
    .long   0
    .long   1
    .long   2
    .long   3
.LCPI2_1:
    .long   4
    .long   4
    .long   4
    .long   4
.LCPI2_2:
    .long   8
    .long   8
    .long   8
    .long   8
.LCPI2_3:
    .long   12
    .long   12
    .long   12
    .long   12
.LCPI2_4:
    .long   16
    .long   16
    .long   16
    .long   16
.LCPI2_5:
    .long   20
    .long   20
    .long   20
    .long   20
.LCPI2_6:
    .long   24
    .long   24
    .long   24
    .long   24
.LCPI2_7:
    .long   28
    .long   28
    .long   28
    .long   28
.LCPI2_8:
    .long   32
    .long   32
    .long   32
    .long   32
    .section    .text._ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E,@function
_ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E:
    .cfi_startproc
    pushq   %r14
.Ltmp10:
    .cfi_def_cfa_offset 16
    pushq   %rbx
.Ltmp11:
    .cfi_def_cfa_offset 24
    subq    $40, %rsp
.Ltmp12:
    .cfi_def_cfa_offset 64
.Ltmp13:
    .cfi_offset %rbx, -24
.Ltmp14:
    .cfi_offset %r14, -16
    movq    %rdi, %r14
    leaq    24(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%r14), %r9
    testq   %r9, %r9
    je  .LBB2_15
    xorl    %ecx, %ecx
    leaq    8(%rsp), %r8
    movdqa  .LCPI2_0(%rip), %xmm8
    movdqa  .LCPI2_1(%rip), %xmm15
    movdqa  .LCPI2_2(%rip), %xmm2
    movdqa  .LCPI2_3(%rip), %xmm9
    movdqa  .LCPI2_4(%rip), %xmm10
    movdqa  .LCPI2_5(%rip), %xmm11
    movdqa  .LCPI2_6(%rip), %xmm12
    movdqa  .LCPI2_7(%rip), %xmm13
    movdqa  .LCPI2_8(%rip), %xmm14
    .p2align    4, 0x90
.LBB2_2:
    incq    %rcx
    movl    $1000, 8(%rsp)
    #APP
    #NO_APP
    movl    8(%rsp), %esi
    testl   %esi, %esi
    movl    $0, %edx
    jle .LBB2_14
    xorl    %edx, %edx
    cmpl    $8, %esi
    jae .LBB2_5
    xorl    %edi, %edi
    jmp .LBB2_13
    .p2align    4, 0x90
.LBB2_5:
    movl    %esi, %eax
    andl    $-8, %eax
    movl    $0, %edi
    je  .LBB2_13
    leal    -8(%rax), %edx
    movl    %edx, %edi
    shrl    $3, %edi
    leal    1(%rdi), %ebx
    andl    $3, %ebx
    pxor    %xmm3, %xmm3
    cmpl    $24, %edx
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    jb  .LBB2_9
    leal    -1(%rbx), %edx
    subl    %edi, %edx
    pxor    %xmm3, %xmm3
    pxor    %xmm4, %xmm4
    movdqa  %xmm8, %xmm5
    .p2align    4, 0x90
.LBB2_8:
    movdqa  %xmm5, %xmm6
    paddd   %xmm15, %xmm6
    movdqa  %xmm5, %xmm7
    paddd   %xmm2, %xmm7
    pxor    %xmm5, %xmm3
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm4
    paddd   %xmm9, %xmm4
    movdqa  %xmm5, %xmm0
    paddd   %xmm10, %xmm0
    pxor    %xmm7, %xmm3
    movdqa  %xmm5, %xmm7
    paddd   %xmm11, %xmm7
    movdqa  %xmm5, %xmm1
    paddd   %xmm12, %xmm1
    pxor    %xmm4, %xmm7
    pxor    %xmm6, %xmm7
    movdqa  %xmm5, %xmm4
    paddd   %xmm13, %xmm4
    paddd   %xmm14, %xmm5
    pxor    %xmm0, %xmm1
    pxor    %xmm1, %xmm3
    pxor    %xmm7, %xmm4
    addl    $4, %edx
    jne .LBB2_8
.LBB2_9:
    testl   %ebx, %ebx
    je  .LBB2_12
    negl    %ebx
    .p2align    4, 0x90
.LBB2_11:
    movdqa  %xmm5, %xmm0
    pxor    %xmm5, %xmm3
    paddd   %xmm15, %xmm5
    paddd   %xmm2, %xmm0
    pxor    %xmm5, %xmm4
    incl    %ebx
    movdqa  %xmm0, %xmm5
    jne .LBB2_11
.LBB2_12:
    pxor    %xmm4, %xmm3
    pshufd  $78, %xmm3, %xmm0
    pxor    %xmm3, %xmm0
    pshufd  $229, %xmm0, %xmm1
    pxor    %xmm0, %xmm1
    movd    %xmm1, %edx
    cmpl    %eax, %esi
    movl    %eax, %edi
    je  .LBB2_14
    .p2align    4, 0x90
.LBB2_13:
    xorl    %edi, %edx
    leal    1(%rdi), %eax
    cmpl    %eax, %esi
    movl    %eax, %edi
    jne .LBB2_13
.LBB2_14:
    movl    %edx, 8(%rsp)
    #APP
    #NO_APP
    cmpq    %r9, %rcx
    jne .LBB2_2
.LBB2_15:
    leaq    8(%rsp), %rdi
    leaq    24(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    8(%rsp), %rax
    movl    16(%rsp), %ecx
    movq    %rax, 8(%r14)
    movl    %ecx, 16(%r14)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    retq
.Lfunc_end2:
    .size   _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E, .Lfunc_end2-_ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E
    .cfi_endproc

    .section    .text._ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E,@function
_ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E:
    .cfi_startproc
    pushq   %rbx
.Ltmp15:
    .cfi_def_cfa_offset 16
    subq    $32, %rsp
.Ltmp16:
    .cfi_def_cfa_offset 48
.Ltmp17:
    .cfi_offset %rbx, -16
    movq    %rdi, %rbx
    leaq    16(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%rbx), %rax
    testq   %rax, %rax
    je  .LBB3_3
    leaq    (%rsp), %rcx
    .p2align    4, 0x90
.LBB3_2:
    movl    $1000, (%rsp)
    #APP
    #NO_APP
    #APP
    #NO_APP
    decq    %rax
    jne .LBB3_2
.LBB3_3:
    leaq    (%rsp), %rdi
    leaq    16(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    (%rsp), %rax
    movl    8(%rsp), %ecx
    movq    %rax, 8(%rbx)
    movl    %ecx, 16(%rbx)
    addq    $32, %rsp
    popq    %rbx
    retq
.Lfunc_end3:
    .size   _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E, .Lfunc_end3-_ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E
    .cfi_endproc

    .section    .rodata.cst16,"aM",@progbits,16
    .p2align    4
.LCPI4_0:
    .long   0
    .long   1
    .long   2
    .long   3
.LCPI4_1:
    .long   4
    .long   4
    .long   4
    .long   4
.LCPI4_2:
    .long   8
    .long   8
    .long   8
    .long   8
.LCPI4_3:
    .long   12
    .long   12
    .long   12
    .long   12
.LCPI4_4:
    .long   16
    .long   16
    .long   16
    .long   16
.LCPI4_5:
    .long   20
    .long   20
    .long   20
    .long   20
.LCPI4_6:
    .long   24
    .long   24
    .long   24
    .long   24
.LCPI4_7:
    .long   28
    .long   28
    .long   28
    .long   28
.LCPI4_8:
    .long   32
    .long   32
    .long   32
    .long   32
.LCPI4_9:
    .long   36
    .long   36
    .long   36
    .long   36
.LCPI4_10:
    .long   40
    .long   40
    .long   40
    .long   40
    .section    .text._ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E,"ax",@progbits
    .globl  _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E
    .p2align    4, 0x90
    .type   _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E,@function
_ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E:
    .cfi_startproc
    pushq   %rbx
.Ltmp18:
    .cfi_def_cfa_offset 16
    subq    $32, %rsp
.Ltmp19:
    .cfi_def_cfa_offset 48
.Ltmp20:
    .cfi_offset %rbx, -16
    movq    %rdi, %rbx
    leaq    16(%rsp), %rdi
    callq   _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
    movq    (%rbx), %rax
    testq   %rax, %rax
    je  .LBB4_5
    xorl    %ecx, %ecx
    movdqa  .LCPI4_1(%rip), %xmm9
    movdqa  .LCPI4_2(%rip), %xmm10
    movdqa  .LCPI4_3(%rip), %xmm11
    movdqa  .LCPI4_4(%rip), %xmm12
    movdqa  .LCPI4_5(%rip), %xmm13
    movdqa  .LCPI4_6(%rip), %xmm14
    movdqa  .LCPI4_7(%rip), %xmm15
    movdqa  .LCPI4_8(%rip), %xmm0
    movdqa  .LCPI4_9(%rip), %xmm1
    movdqa  .LCPI4_10(%rip), %xmm2
    leaq    (%rsp), %rdx
    .p2align    4, 0x90
.LBB4_2:
    pxor    %xmm3, %xmm3
    movl    $1000, %esi
    pxor    %xmm4, %xmm4
    movdqa  .LCPI4_0(%rip), %xmm5
    .p2align    4, 0x90
.LBB4_3:
    movdqa  %xmm5, %xmm6
    paddd   %xmm9, %xmm6
    movdqa  %xmm5, %xmm7
    paddd   %xmm10, %xmm7
    pxor    %xmm5, %xmm3
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm4
    paddd   %xmm11, %xmm4
    pxor    %xmm7, %xmm3
    movdqa  %xmm5, %xmm7
    paddd   %xmm13, %xmm7
    pxor    %xmm4, %xmm7
    movdqa  %xmm5, %xmm4
    paddd   %xmm12, %xmm4
    pxor    %xmm6, %xmm7
    movdqa  %xmm5, %xmm6
    paddd   %xmm14, %xmm6
    pxor    %xmm4, %xmm6
    movdqa  %xmm5, %xmm8
    paddd   %xmm15, %xmm8
    pxor    %xmm6, %xmm3
    movdqa  %xmm5, %xmm4
    paddd   %xmm0, %xmm4
    pxor    %xmm4, %xmm3
    movdqa  %xmm5, %xmm4
    paddd   %xmm1, %xmm4
    pxor    %xmm8, %xmm4
    pxor    %xmm7, %xmm4
    paddd   %xmm2, %xmm5
    addl    $-40, %esi
    jne .LBB4_3
    pxor    %xmm3, %xmm4
    pshufd  $78, %xmm4, %xmm3
    pxor    %xmm4, %xmm3
    pshufd  $229, %xmm3, %xmm4
    pxor    %xmm3, %xmm4
    incq    %rcx
    movd    %xmm4, (%rsp)
    #APP
    #NO_APP
    #APP
    #NO_APP
    cmpq    %rax, %rcx
    jne .LBB4_2
.LBB4_5:
    leaq    (%rsp), %rdi
    leaq    16(%rsp), %rsi
    callq   _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
    movq    (%rsp), %rax
    movl    8(%rsp), %ecx
    movq    %rax, 8(%rbx)
    movl    %ecx, 16(%rbx)
    addq    $32, %rsp
    popq    %rbx
    retq
.Lfunc_end4:
    .size   _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E, .Lfunc_end4-_ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E
    .cfi_endproc


    .section    ".note.GNU-stack","",@progbits

Looking at the assembly, xor_closure_a , xor_pointer_a and xor_black_box which all have relatively the same performance, evaluate to similar assembly code. 从汇编的xor_closure_a ,性能相对相同的xor_closure_axor_pointer_axor_black_box评估为相似的汇编代码。 Which is, by the way, a lot more instructions than the amount of ASM for poor xor_closure_b . 顺便说一句,对于较差的xor_closure_b ,这比ASM的数量xor_closure_b

This is where my research ends. 这就是我研究的终点。 I would be glad if someone explains, why this blanket of assembly produced by xor_black_box evaluates faster than xor_closure_b . 如果有人解释,我将很高兴,为什么xor_black_box产生的装配xor_black_boxxor_closure_b评估得更快。 Or put it differently, why the version that should be optimized away by the compiler runs notably slower, and which benchmark variant I should trust? 或换句话说,为什么应该由编译器优化的版本运行速度明显慢,我应该信任哪个基准测试变体?

Versions 版本号

rustc --version
rustc 1.13.0-nightly (378195665 2016-09-08)

cargo --version
cargo 0.13.0-nightly (afaffa1 2016-09-06)

Update 更新资料

@Francis Gagné did a fantastic job to investigate the issue in his answer. @FrancisGagné在回答问题时做得非常出色。 The steps that took me to reproduce his results are listed below. 下面列出了促使我重现他的结果的步骤。

Compile benchmarks and emit assembly: 编译基准并发出汇编:

cargo rustc --release -- --test --emit asm

Run produced executable to verify previous behavior: 运行生成的可执行文件以验证以前的行为:

./target/release/deps/xor --bench

running 2 tests
test tests::xor_closure_b ... bench:         925 ns/iter (+/- 9)
test tests::xor_pointer_b ... bench:           0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

Compile patched assembly output ./target/release/deps/xor.s 编译修补的程序集输出./target/release/deps/xor.s

export RUSTLIB=/path/to/lib/rustlib
gcc target/release/deps/xor.s $RUSTLIB/x86_64-unknown-linux-gnu/lib/*.rlib $RUSTLIB/x86_64-unknown-linux-gnu/lib/*.so -pthread -lpthread -lm -ldl

Run updated benchmarks: 运行更新的基准测试:

./a.out --bench

running 2 tests
test tests::xor_closure_b ... bench:           1 ns/iter (+/- 0)
test tests::xor_pointer_b ... bench:           0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

I'll focus on comparing xor_closure_b and xor_pointer_b , since they should have similar performance (ie they should both do nothing). 我将重点比较xor_closure_bxor_pointer_b ,因为它们应该具有相似的性能(即它们都不应该执行任何操作)。

Update: I made a mistake in my initial analysis, as pointed out by @EOF , so I revised the text below. 更新:正如@EOF指出的那样 ,我在最初的分析中犯了一个错误,因此我修改了以下文本。

First, let's look at the LLVM IR generated for those two functions. 首先,让我们看一下为这两个函数生成的LLVM IR。 (I find LLVM IR easier to read than ASM because it's more structured.) (我发现LLVM IR比ASM更易于阅读,因为它的结构更加合理。)

; Function Attrs: uwtable
define internal void @_ZN3xor13xor_closure_b17hb13913a8d2a27b06E(%"11.test::Bencher"* nocapture dereferenceable(32)) unnamed_addr #0 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %dummy.i.i = alloca {}, align 8
  %dummy.i.i.i = alloca i32, align 4
  %start1.i = alloca %"1.std::time::Instant", align 8
  %tmp_ret2.i = alloca %"1.std::time::Duration", align 8
  %1 = bitcast %"1.std::time::Duration"* %tmp_ret2.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %1)
  %2 = bitcast %"1.std::time::Instant"* %start1.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %2)
  call void @_ZN3std4time7Instant3now17h37bccd496c61083dE(%"1.std::time::Instant"* noalias nocapture nonnull sret dereferenceable(16) %start1.i)
  %3 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 0
  %4 = load i64, i64* %3, align 8
  %5 = icmp eq i64 %4, 0
  br i1 %5, label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit, label %bb7.lr.ph.i

bb7.lr.ph.i:                                      ; preds = %entry-block
  %6 = bitcast i32* %dummy.i.i.i to i8*
  %7 = bitcast {}* %dummy.i.i to i8*
  br label %bb7.i

bb7.i:                                            ; preds = %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i", %bb7.lr.ph.i
  %iter.sroa.0.019.i = phi i64 [ 0, %bb7.lr.ph.i ], [ %11, %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i" ]
  call void @llvm.lifetime.start(i64 4, i8* %6) #2
  store i32 1000, i32* %dummy.i.i.i, align 4
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* nonnull %dummy.i.i.i) #2, !srcloc !1
  %8 = load i32, i32* %dummy.i.i.i, align 4
  call void @llvm.lifetime.end(i64 4, i8* %6) #2
  br label %bb7.i.i.i

bb7.i.i.i:                                        ; preds = %bb7.i.i.i, %bb7.i
  %iter.sroa.0.0.i.i.i = phi i32 [ 0, %bb7.i ], [ %iter.sroa.0.1.i.i.i, %bb7.i.i.i ]
  %9 = icmp slt i32 %iter.sroa.0.0.i.i.i, %8
  %10 = zext i1 %9 to i32
  %iter.sroa.0.1.i.i.i = add i32 %10, %iter.sroa.0.0.i.i.i
  br i1 %9, label %bb7.i.i.i, label %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"

"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i": ; preds = %bb7.i.i.i
  %11 = add nuw i64 %iter.sroa.0.019.i, 1
  call void @llvm.lifetime.start(i64 0, i8* %7)
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"({}* nonnull %dummy.i.i) #2, !srcloc !1
  call void @llvm.lifetime.end(i64 0, i8* %7)
  %exitcond.i = icmp eq i64 %11, %4
  br i1 %exitcond.i, label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit, label %bb7.i

_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit: ; preds = %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"
  br label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit

_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit:   ; preds = %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit, %entry-block
  call void @_ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE(%"1.std::time::Duration"* noalias nocapture nonnull sret dereferenceable(16) %tmp_ret2.i, %"1.std::time::Instant"* noalias nonnull readonly dereferenceable(16) %start1.i)
  %12 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 0
  %13 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 1
  %14 = load i64, i64* %12, align 8
  %15 = load i32, i32* %13, align 8
  %16 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 0
  store i64 %14, i64* %16, align 8
  %17 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 1
  store i32 %15, i32* %17, align 4
  call void @llvm.lifetime.end(i64 16, i8* %2)
  call void @llvm.lifetime.end(i64 16, i8* %1)
  ret void
}

; Function Attrs: uwtable
define internal void @_ZN3xor13xor_pointer_b17h7ba0f9760d9fd9f8E(%"11.test::Bencher"* nocapture dereferenceable(32)) unnamed_addr #0 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %dummy.i.i = alloca {}, align 8
  %dummy.i.i.i = alloca i32, align 4
  %start1.i = alloca %"1.std::time::Instant", align 8
  %tmp_ret2.i = alloca %"1.std::time::Duration", align 8
  %1 = bitcast %"1.std::time::Duration"* %tmp_ret2.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %1)
  %2 = bitcast %"1.std::time::Instant"* %start1.i to i8*
  call void @llvm.lifetime.start(i64 16, i8* %2)
  call void @_ZN3std4time7Instant3now17h37bccd496c61083dE(%"1.std::time::Instant"* noalias nocapture nonnull sret dereferenceable(16) %start1.i)
  %3 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 0
  %4 = load i64, i64* %3, align 8
  %5 = icmp eq i64 %4, 0
  br i1 %5, label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit, label %bb7.lr.ph.i

bb7.lr.ph.i:                                      ; preds = %entry-block
  %6 = bitcast i32* %dummy.i.i.i to i8*
  %7 = bitcast {}* %dummy.i.i to i8*
  br label %bb7.i

bb7.i:                                            ; preds = %bb7.i, %bb7.lr.ph.i
  %iter.sroa.0.019.i = phi i64 [ 0, %bb7.lr.ph.i ], [ %8, %bb7.i ]
  %8 = add nuw i64 %iter.sroa.0.019.i, 1
  call void @llvm.lifetime.start(i64 4, i8* %6) #2
  store i32 1000, i32* %dummy.i.i.i, align 4
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* nonnull %dummy.i.i.i) #2, !srcloc !1
  call void @llvm.lifetime.end(i64 4, i8* %6) #2
  call void @llvm.lifetime.start(i64 0, i8* %7)
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"({}* nonnull %dummy.i.i) #2, !srcloc !1
  call void @llvm.lifetime.end(i64 0, i8* %7)
  %exitcond.i = icmp eq i64 %8, %4
  br i1 %exitcond.i, label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit, label %bb7.i

_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit: ; preds = %bb7.i
  br label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit

_ZN4test7Bencher4iter17hae343b1316e5897bE.exit:   ; preds = %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit, %entry-block
  call void @_ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE(%"1.std::time::Duration"* noalias nocapture nonnull sret dereferenceable(16) %tmp_ret2.i, %"1.std::time::Instant"* noalias nonnull readonly dereferenceable(16) %start1.i)
  %9 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 0
  %10 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 1
  %11 = load i64, i64* %9, align 8
  %12 = load i32, i32* %10, align 8
  %13 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 0
  store i64 %11, i64* %13, align 8
  %14 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 1
  store i32 %12, i32* %14, align 4
  call void @llvm.lifetime.end(i64 16, i8* %2)
  call void @llvm.lifetime.end(i64 16, i8* %1)
  ret void
}

If we compare the LLVM IR for xor_closure_b and xor_pointer_b , they look pretty similar. 如果比较xor_closure_bxor_pointer_b的LLVM IR,它们看起来很相似。 However, one difference stands out: the bb7.iii block has been optimized out in xor_pointer_b , but not in xor_closure_b , for some reason. 但是,有一个区别是突出的:出于某种原因, bb7.iii块已在xor_pointer_b进行了优化,但在xor_closure_b未进行优化。 Here's the block: 这是块:

bb7.i.i.i:                                        ; preds = %bb7.i.i.i, %bb7.i
  %iter.sroa.0.0.i.i.i = phi i32 [ 0, %bb7.i ], [ %iter.sroa.0.1.i.i.i, %bb7.i.i.i ]
  %9 = icmp slt i32 %iter.sroa.0.0.i.i.i, %8
  %10 = zext i1 %9 to i32
  %iter.sroa.0.1.i.i.i = add i32 %10, %iter.sroa.0.0.i.i.i
  br i1 %9, label %bb7.i.i.i, label %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"

And here it is translated to ASM: 此处将其翻译为ASM:

.LBB1_3:
    xorl    %edi, %edi
    cmpl    %esi, %ebx
    setl    %dil
    addl    %ebx, %edi
    cmpl    %esi, %ebx
    movl    %edi, %ebx
    jl  .LBB1_3

This is a pretty dumb way to loop from 0 to 1000. Changing the code above to this: 这是一种从0到1000循环的相当愚蠢的方法。将上面的代码更改为此:

.LBB1_3:
    cmpl    %esi, %ebx
    jge .LBB1_3a
    incl    %ebx
    cmpl    %esi, %ebx
    jl  .LBB1_3
.LBB1_3a:

makes the benchmark for xor_closure_b drop from 781 ns/iter (+/- 19) to 270 ns/iter (+/- 7) on my machine. 使xor_closure_b的基准在我的计算机上从781 ns/iter (+/- 19)降至270 ns/iter (+/- 7)

I cannot say for sure why the code generated by the compiler is so slow, or why it wasn't optimized away in the first place (as it was in xor_pointer_b )... However, it seems that xor_pointer_a and xor_closure_a and faster still because the generated code is vectorized, which has the effect that the loop performs fewer iterations (ie the loop is unrolled), by a factor of 32 (eg .LBB0_8 , the main loop in xor_closure_a , performs 31 iterations, then the rest is handled after the loop). 我不能肯定地说为什么编译器生成的代码这么慢,或者为什么没有首先对其进行优化(就像在xor_pointer_b )……但是,似乎xor_pointer_axor_closure_a甚至更快,仍然是因为生成的代码被矢量,其具有循环执行更少的迭代(即,循环被展开)中,由32(例如一个因素的影响.LBB0_8 ,在主回路xor_closure_a ,进行31次重复,则其余部分被后处理循环)。

For reference, I compiled the edited ASM with this command line: 作为参考,我使用以下命令行编译了已编辑的ASM:

$ gcc target/release/xor-71758a2519026d86.s ~/.multirust/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/lib{test,term,getopts,rustc_unicode,std,libc,rand,collections,alloc_system,alloc,core,panic_unwind}-411f48d3.rlib -pthread -lpthread -lm -ldl

and I ran it with ./a.out --bench . 我用./a.out --bench运行它。 Also, my CPU is an Intel Core i7-4770K. 另外,我的CPU是Intel Core i7-4770K。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM