[英]Odd results when evaluating benchmark example from Rust Book
I am getting odd results from microbenchmarks that can be reproduced with the the example bench_xor_1000_ints
from the Rust Book chapter on Benchmark tests . 我从bench_xor_1000_ints
基准测试中得到了奇怪的结果,这些结果可以通过“ 基准测试 ”的“锈皮书”一章中的示例bench_xor_1000_ints
进行复制。
The section Gotcha: optimizations gives general recommendations to return a value from the Bencher::iter
closure and/or to use the black_box
function. “获取:优化 ”部分提供了一些一般性建议,以从black_box
Bencher::iter
闭包中返回一个值和/或使用black_box
函数。 I ended up with these five variants: 我最终得到了以下五个变体:
#![feature(test)]
extern crate test;
use test::{Bencher, black_box};
fn xor(x: i32, y: i32) -> i32 { x ^ y }
#[bench]
fn xor_closure_a(b: &mut Bencher) {
b.iter(|| {
let n = black_box(1000);
(0..n).fold(0, |a, b| a ^ b)
});
}
#[bench]
fn xor_closure_b(b: &mut Bencher) {
b.iter(|| {
let n = black_box(1000);
(0..n).fold(0, |a, b| a ^ b);
});
}
#[bench]
fn xor_pointer_a(b: &mut Bencher) {
b.iter(|| {
let n = black_box(1000);
(0..n).fold(0, xor)
});
}
#[bench]
fn xor_pointer_b(b: &mut Bencher) {
b.iter(|| {
let n = black_box(1000);
(0..n).fold(0, xor);
});
}
#[bench] // closure/pointer or semicolon doesn't affect results
fn xor_black_box(b: &mut Bencher) {
b.iter(|| {
black_box((0..1000).fold(0, xor));
});
}
|a, b| a ^ b
前缀xor_closure_ :通过闭包|a, b| a ^ b
|a, b| a ^ b
xor
前缀xor_pointer_ :传递函数指针xor
b.iter
closure returns value (note lack of semicolon) 后缀_a : b.iter
闭包返回值(注意缺少分号) b.iter
closure doesn't return value (ends with semicolon) 后缀_b : b.iter
闭包不返回值(以分号结尾) black_box
and hope for the best xor_black_box :将所有内容都放入black_box
并希望得到最好的 test tests::xor_black_box ... bench: 69 ns/iter (+/- 1)
test tests::xor_closure_a ... bench: 70 ns/iter (+/- 1)
test tests::xor_closure_b ... bench: 921 ns/iter (+/- 5)
test tests::xor_pointer_a ... bench: 60 ns/iter (+/- 1)
test tests::xor_pointer_b ... bench: 0 ns/iter (+/- 0)
Some observations regarding measurements: 有关测量的一些观察结果:
let n = black_box(1000)
and inlining it into range (0..1000).fold ...
doesn't affect any of the results 删除黑盒迭代计数器, let n = black_box(1000)
并将其内联到范围(0..1000).fold ...
不影响任何结果 n
scales measurements appropriately (except for xor_pointer_b
optimized to 0) 迭代计数器n
缩放比例会适当地缩放测量值( xor_pointer_b
优化为0除外) On the whole, results agree with the general recommendations from Rust Book, except for xor_closure_b
benchmark. 总体而言,除xor_closure_b
基准测试外,结果与Rust Book的一般建议xor_closure_b
。
More specifically, results with the suffix _a (that return a value) seem to agree with the black-boxed xor_black_box which sounds good to me. 更具体地说,后缀_a (返回值)的结果似乎与黑框xor_black_box一致,这对我来说听起来不错。 And xor_pointer_b
optimized to 0 seems legit since its closure doesn't return the value. 并且xor_pointer_b
优化为0似乎是合法的,因为其关闭不会返回该值。 But the result of xor_closure_b
is odd. 但是xor_closure_b
的结果很奇怪。
Assembly code may clarify this question. 汇编代码可以澄清这个问题。 How can I prevent the Rust benchmark library from optimizing away my code? 如何防止Rust基准测试库优化我的代码? gives a good overview on reading assembly from the Rust benchmark tests. 很好地概述了如何从Rust基准测试中读取汇编。
cargo rustc --release -- --emit asm
produces this output : 产生此输出 :
.text
.file "fpinrust.cgu-0.rs"
.section .rodata.cst16,"aM",@progbits,16
.p2align 4
.LCPI0_0:
.long 0
.long 1
.long 2
.long 3
.LCPI0_1:
.long 4
.long 4
.long 4
.long 4
.LCPI0_2:
.long 8
.long 8
.long 8
.long 8
.LCPI0_3:
.long 12
.long 12
.long 12
.long 12
.LCPI0_4:
.long 16
.long 16
.long 16
.long 16
.LCPI0_5:
.long 20
.long 20
.long 20
.long 20
.LCPI0_6:
.long 24
.long 24
.long 24
.long 24
.LCPI0_7:
.long 28
.long 28
.long 28
.long 28
.LCPI0_8:
.long 32
.long 32
.long 32
.long 32
.section .text._ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E,"ax",@progbits
.globl _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E
.p2align 4, 0x90
.type _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E,@function
_ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E:
.cfi_startproc
pushq %r14
.Ltmp0:
.cfi_def_cfa_offset 16
pushq %rbx
.Ltmp1:
.cfi_def_cfa_offset 24
subq $40, %rsp
.Ltmp2:
.cfi_def_cfa_offset 64
.Ltmp3:
.cfi_offset %rbx, -24
.Ltmp4:
.cfi_offset %r14, -16
movq %rdi, %r14
leaq 24(%rsp), %rdi
callq _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
movq (%r14), %r9
testq %r9, %r9
je .LBB0_15
xorl %ecx, %ecx
leaq 8(%rsp), %r8
movdqa .LCPI0_0(%rip), %xmm8
movdqa .LCPI0_1(%rip), %xmm15
movdqa .LCPI0_2(%rip), %xmm2
movdqa .LCPI0_3(%rip), %xmm9
movdqa .LCPI0_4(%rip), %xmm10
movdqa .LCPI0_5(%rip), %xmm11
movdqa .LCPI0_6(%rip), %xmm12
movdqa .LCPI0_7(%rip), %xmm13
movdqa .LCPI0_8(%rip), %xmm14
.p2align 4, 0x90
.LBB0_2:
incq %rcx
movl $1000, 8(%rsp)
#APP
#NO_APP
movl 8(%rsp), %esi
testl %esi, %esi
movl $0, %edx
jle .LBB0_14
xorl %edx, %edx
cmpl $8, %esi
jae .LBB0_5
xorl %edi, %edi
jmp .LBB0_13
.p2align 4, 0x90
.LBB0_5:
movl %esi, %eax
andl $-8, %eax
movl $0, %edi
je .LBB0_13
leal -8(%rax), %edx
movl %edx, %edi
shrl $3, %edi
leal 1(%rdi), %ebx
andl $3, %ebx
pxor %xmm3, %xmm3
cmpl $24, %edx
pxor %xmm4, %xmm4
movdqa %xmm8, %xmm5
jb .LBB0_9
leal -1(%rbx), %edx
subl %edi, %edx
pxor %xmm3, %xmm3
pxor %xmm4, %xmm4
movdqa %xmm8, %xmm5
.p2align 4, 0x90
.LBB0_8:
movdqa %xmm5, %xmm6
paddd %xmm15, %xmm6
movdqa %xmm5, %xmm7
paddd %xmm2, %xmm7
pxor %xmm5, %xmm3
pxor %xmm4, %xmm6
movdqa %xmm5, %xmm4
paddd %xmm9, %xmm4
movdqa %xmm5, %xmm0
paddd %xmm10, %xmm0
pxor %xmm7, %xmm3
movdqa %xmm5, %xmm7
paddd %xmm11, %xmm7
movdqa %xmm5, %xmm1
paddd %xmm12, %xmm1
pxor %xmm4, %xmm7
pxor %xmm6, %xmm7
movdqa %xmm5, %xmm4
paddd %xmm13, %xmm4
paddd %xmm14, %xmm5
pxor %xmm0, %xmm1
pxor %xmm1, %xmm3
pxor %xmm7, %xmm4
addl $4, %edx
jne .LBB0_8
.LBB0_9:
testl %ebx, %ebx
je .LBB0_12
negl %ebx
.p2align 4, 0x90
.LBB0_11:
movdqa %xmm5, %xmm0
pxor %xmm5, %xmm3
paddd %xmm15, %xmm5
paddd %xmm2, %xmm0
pxor %xmm5, %xmm4
incl %ebx
movdqa %xmm0, %xmm5
jne .LBB0_11
.LBB0_12:
pxor %xmm4, %xmm3
pshufd $78, %xmm3, %xmm0
pxor %xmm3, %xmm0
pshufd $229, %xmm0, %xmm1
pxor %xmm0, %xmm1
movd %xmm1, %edx
cmpl %eax, %esi
movl %eax, %edi
je .LBB0_14
.p2align 4, 0x90
.LBB0_13:
xorl %edi, %edx
leal 1(%rdi), %eax
cmpl %eax, %esi
movl %eax, %edi
jne .LBB0_13
.LBB0_14:
movl %edx, 8(%rsp)
#APP
#NO_APP
cmpq %r9, %rcx
jne .LBB0_2
.LBB0_15:
leaq 8(%rsp), %rdi
leaq 24(%rsp), %rsi
callq _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
movq 8(%rsp), %rax
movl 16(%rsp), %ecx
movq %rax, 8(%r14)
movl %ecx, 16(%r14)
addq $40, %rsp
popq %rbx
popq %r14
retq
.Lfunc_end0:
.size _ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E, .Lfunc_end0-_ZN8fpinrust5tests13xor_closure_a17h4df097d1e565a700E
.cfi_endproc
.section .text._ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E,"ax",@progbits
.globl _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E
.p2align 4, 0x90
.type _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E,@function
_ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E:
.cfi_startproc
pushq %r14
.Ltmp5:
.cfi_def_cfa_offset 16
pushq %rbx
.Ltmp6:
.cfi_def_cfa_offset 24
subq $40, %rsp
.Ltmp7:
.cfi_def_cfa_offset 64
.Ltmp8:
.cfi_offset %rbx, -24
.Ltmp9:
.cfi_offset %r14, -16
movq %rdi, %r14
leaq 24(%rsp), %rdi
callq _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
movq (%r14), %rax
testq %rax, %rax
je .LBB1_5
xorl %ecx, %ecx
leaq 8(%rsp), %rdx
.p2align 4, 0x90
.LBB1_2:
movl $1000, 8(%rsp)
#APP
#NO_APP
movl 8(%rsp), %esi
xorl %ebx, %ebx
.p2align 4, 0x90
.LBB1_3:
xorl %edi, %edi
cmpl %esi, %ebx
setl %dil
addl %ebx, %edi
cmpl %esi, %ebx
movl %edi, %ebx
jl .LBB1_3
incq %rcx
#APP
#NO_APP
cmpq %rax, %rcx
jne .LBB1_2
.LBB1_5:
leaq 8(%rsp), %rdi
leaq 24(%rsp), %rsi
callq _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
movq 8(%rsp), %rax
movl 16(%rsp), %ecx
movq %rax, 8(%r14)
movl %ecx, 16(%r14)
addq $40, %rsp
popq %rbx
popq %r14
retq
.Lfunc_end1:
.size _ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E, .Lfunc_end1-_ZN8fpinrust5tests13xor_closure_b17h4bdd5e59e5c19a55E
.cfi_endproc
.section .rodata.cst16,"aM",@progbits,16
.p2align 4
.LCPI2_0:
.long 0
.long 1
.long 2
.long 3
.LCPI2_1:
.long 4
.long 4
.long 4
.long 4
.LCPI2_2:
.long 8
.long 8
.long 8
.long 8
.LCPI2_3:
.long 12
.long 12
.long 12
.long 12
.LCPI2_4:
.long 16
.long 16
.long 16
.long 16
.LCPI2_5:
.long 20
.long 20
.long 20
.long 20
.LCPI2_6:
.long 24
.long 24
.long 24
.long 24
.LCPI2_7:
.long 28
.long 28
.long 28
.long 28
.LCPI2_8:
.long 32
.long 32
.long 32
.long 32
.section .text._ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E,"ax",@progbits
.globl _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E
.p2align 4, 0x90
.type _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E,@function
_ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E:
.cfi_startproc
pushq %r14
.Ltmp10:
.cfi_def_cfa_offset 16
pushq %rbx
.Ltmp11:
.cfi_def_cfa_offset 24
subq $40, %rsp
.Ltmp12:
.cfi_def_cfa_offset 64
.Ltmp13:
.cfi_offset %rbx, -24
.Ltmp14:
.cfi_offset %r14, -16
movq %rdi, %r14
leaq 24(%rsp), %rdi
callq _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
movq (%r14), %r9
testq %r9, %r9
je .LBB2_15
xorl %ecx, %ecx
leaq 8(%rsp), %r8
movdqa .LCPI2_0(%rip), %xmm8
movdqa .LCPI2_1(%rip), %xmm15
movdqa .LCPI2_2(%rip), %xmm2
movdqa .LCPI2_3(%rip), %xmm9
movdqa .LCPI2_4(%rip), %xmm10
movdqa .LCPI2_5(%rip), %xmm11
movdqa .LCPI2_6(%rip), %xmm12
movdqa .LCPI2_7(%rip), %xmm13
movdqa .LCPI2_8(%rip), %xmm14
.p2align 4, 0x90
.LBB2_2:
incq %rcx
movl $1000, 8(%rsp)
#APP
#NO_APP
movl 8(%rsp), %esi
testl %esi, %esi
movl $0, %edx
jle .LBB2_14
xorl %edx, %edx
cmpl $8, %esi
jae .LBB2_5
xorl %edi, %edi
jmp .LBB2_13
.p2align 4, 0x90
.LBB2_5:
movl %esi, %eax
andl $-8, %eax
movl $0, %edi
je .LBB2_13
leal -8(%rax), %edx
movl %edx, %edi
shrl $3, %edi
leal 1(%rdi), %ebx
andl $3, %ebx
pxor %xmm3, %xmm3
cmpl $24, %edx
pxor %xmm4, %xmm4
movdqa %xmm8, %xmm5
jb .LBB2_9
leal -1(%rbx), %edx
subl %edi, %edx
pxor %xmm3, %xmm3
pxor %xmm4, %xmm4
movdqa %xmm8, %xmm5
.p2align 4, 0x90
.LBB2_8:
movdqa %xmm5, %xmm6
paddd %xmm15, %xmm6
movdqa %xmm5, %xmm7
paddd %xmm2, %xmm7
pxor %xmm5, %xmm3
pxor %xmm4, %xmm6
movdqa %xmm5, %xmm4
paddd %xmm9, %xmm4
movdqa %xmm5, %xmm0
paddd %xmm10, %xmm0
pxor %xmm7, %xmm3
movdqa %xmm5, %xmm7
paddd %xmm11, %xmm7
movdqa %xmm5, %xmm1
paddd %xmm12, %xmm1
pxor %xmm4, %xmm7
pxor %xmm6, %xmm7
movdqa %xmm5, %xmm4
paddd %xmm13, %xmm4
paddd %xmm14, %xmm5
pxor %xmm0, %xmm1
pxor %xmm1, %xmm3
pxor %xmm7, %xmm4
addl $4, %edx
jne .LBB2_8
.LBB2_9:
testl %ebx, %ebx
je .LBB2_12
negl %ebx
.p2align 4, 0x90
.LBB2_11:
movdqa %xmm5, %xmm0
pxor %xmm5, %xmm3
paddd %xmm15, %xmm5
paddd %xmm2, %xmm0
pxor %xmm5, %xmm4
incl %ebx
movdqa %xmm0, %xmm5
jne .LBB2_11
.LBB2_12:
pxor %xmm4, %xmm3
pshufd $78, %xmm3, %xmm0
pxor %xmm3, %xmm0
pshufd $229, %xmm0, %xmm1
pxor %xmm0, %xmm1
movd %xmm1, %edx
cmpl %eax, %esi
movl %eax, %edi
je .LBB2_14
.p2align 4, 0x90
.LBB2_13:
xorl %edi, %edx
leal 1(%rdi), %eax
cmpl %eax, %esi
movl %eax, %edi
jne .LBB2_13
.LBB2_14:
movl %edx, 8(%rsp)
#APP
#NO_APP
cmpq %r9, %rcx
jne .LBB2_2
.LBB2_15:
leaq 8(%rsp), %rdi
leaq 24(%rsp), %rsi
callq _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
movq 8(%rsp), %rax
movl 16(%rsp), %ecx
movq %rax, 8(%r14)
movl %ecx, 16(%r14)
addq $40, %rsp
popq %rbx
popq %r14
retq
.Lfunc_end2:
.size _ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E, .Lfunc_end2-_ZN8fpinrust5tests13xor_pointer_a17hca6c00df2b597ce7E
.cfi_endproc
.section .text._ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E,"ax",@progbits
.globl _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E
.p2align 4, 0x90
.type _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E,@function
_ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E:
.cfi_startproc
pushq %rbx
.Ltmp15:
.cfi_def_cfa_offset 16
subq $32, %rsp
.Ltmp16:
.cfi_def_cfa_offset 48
.Ltmp17:
.cfi_offset %rbx, -16
movq %rdi, %rbx
leaq 16(%rsp), %rdi
callq _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
movq (%rbx), %rax
testq %rax, %rax
je .LBB3_3
leaq (%rsp), %rcx
.p2align 4, 0x90
.LBB3_2:
movl $1000, (%rsp)
#APP
#NO_APP
#APP
#NO_APP
decq %rax
jne .LBB3_2
.LBB3_3:
leaq (%rsp), %rdi
leaq 16(%rsp), %rsi
callq _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
movq (%rsp), %rax
movl 8(%rsp), %ecx
movq %rax, 8(%rbx)
movl %ecx, 16(%rbx)
addq $32, %rsp
popq %rbx
retq
.Lfunc_end3:
.size _ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E, .Lfunc_end3-_ZN8fpinrust5tests13xor_pointer_b17h4ef66678d6655ef4E
.cfi_endproc
.section .rodata.cst16,"aM",@progbits,16
.p2align 4
.LCPI4_0:
.long 0
.long 1
.long 2
.long 3
.LCPI4_1:
.long 4
.long 4
.long 4
.long 4
.LCPI4_2:
.long 8
.long 8
.long 8
.long 8
.LCPI4_3:
.long 12
.long 12
.long 12
.long 12
.LCPI4_4:
.long 16
.long 16
.long 16
.long 16
.LCPI4_5:
.long 20
.long 20
.long 20
.long 20
.LCPI4_6:
.long 24
.long 24
.long 24
.long 24
.LCPI4_7:
.long 28
.long 28
.long 28
.long 28
.LCPI4_8:
.long 32
.long 32
.long 32
.long 32
.LCPI4_9:
.long 36
.long 36
.long 36
.long 36
.LCPI4_10:
.long 40
.long 40
.long 40
.long 40
.section .text._ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E,"ax",@progbits
.globl _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E
.p2align 4, 0x90
.type _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E,@function
_ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E:
.cfi_startproc
pushq %rbx
.Ltmp18:
.cfi_def_cfa_offset 16
subq $32, %rsp
.Ltmp19:
.cfi_def_cfa_offset 48
.Ltmp20:
.cfi_offset %rbx, -16
movq %rdi, %rbx
leaq 16(%rsp), %rdi
callq _ZN3std4time7Instant3now17h37bccd496c61083dE@PLT
movq (%rbx), %rax
testq %rax, %rax
je .LBB4_5
xorl %ecx, %ecx
movdqa .LCPI4_1(%rip), %xmm9
movdqa .LCPI4_2(%rip), %xmm10
movdqa .LCPI4_3(%rip), %xmm11
movdqa .LCPI4_4(%rip), %xmm12
movdqa .LCPI4_5(%rip), %xmm13
movdqa .LCPI4_6(%rip), %xmm14
movdqa .LCPI4_7(%rip), %xmm15
movdqa .LCPI4_8(%rip), %xmm0
movdqa .LCPI4_9(%rip), %xmm1
movdqa .LCPI4_10(%rip), %xmm2
leaq (%rsp), %rdx
.p2align 4, 0x90
.LBB4_2:
pxor %xmm3, %xmm3
movl $1000, %esi
pxor %xmm4, %xmm4
movdqa .LCPI4_0(%rip), %xmm5
.p2align 4, 0x90
.LBB4_3:
movdqa %xmm5, %xmm6
paddd %xmm9, %xmm6
movdqa %xmm5, %xmm7
paddd %xmm10, %xmm7
pxor %xmm5, %xmm3
pxor %xmm4, %xmm6
movdqa %xmm5, %xmm4
paddd %xmm11, %xmm4
pxor %xmm7, %xmm3
movdqa %xmm5, %xmm7
paddd %xmm13, %xmm7
pxor %xmm4, %xmm7
movdqa %xmm5, %xmm4
paddd %xmm12, %xmm4
pxor %xmm6, %xmm7
movdqa %xmm5, %xmm6
paddd %xmm14, %xmm6
pxor %xmm4, %xmm6
movdqa %xmm5, %xmm8
paddd %xmm15, %xmm8
pxor %xmm6, %xmm3
movdqa %xmm5, %xmm4
paddd %xmm0, %xmm4
pxor %xmm4, %xmm3
movdqa %xmm5, %xmm4
paddd %xmm1, %xmm4
pxor %xmm8, %xmm4
pxor %xmm7, %xmm4
paddd %xmm2, %xmm5
addl $-40, %esi
jne .LBB4_3
pxor %xmm3, %xmm4
pshufd $78, %xmm4, %xmm3
pxor %xmm4, %xmm3
pshufd $229, %xmm3, %xmm4
pxor %xmm3, %xmm4
incq %rcx
movd %xmm4, (%rsp)
#APP
#NO_APP
#APP
#NO_APP
cmpq %rax, %rcx
jne .LBB4_2
.LBB4_5:
leaq (%rsp), %rdi
leaq 16(%rsp), %rsi
callq _ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE@PLT
movq (%rsp), %rax
movl 8(%rsp), %ecx
movq %rax, 8(%rbx)
movl %ecx, 16(%rbx)
addq $32, %rsp
popq %rbx
retq
.Lfunc_end4:
.size _ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E, .Lfunc_end4-_ZN8fpinrust5tests13xor_black_box17h8af7e6f80fe83dc8E
.cfi_endproc
.section ".note.GNU-stack","",@progbits
Looking at the assembly, xor_closure_a
, xor_pointer_a
and xor_black_box
which all have relatively the same performance, evaluate to similar assembly code. 从汇编的xor_closure_a
,性能相对相同的xor_closure_a
, xor_pointer_a
和xor_black_box
评估为相似的汇编代码。 Which is, by the way, a lot more instructions than the amount of ASM for poor xor_closure_b
. 顺便说一句,对于较差的xor_closure_b
,这比ASM的数量xor_closure_b
。
This is where my research ends. 这就是我研究的终点。 I would be glad if someone explains, why this blanket of assembly produced by xor_black_box
evaluates faster than xor_closure_b
. 如果有人解释,我将很高兴,为什么xor_black_box
产生的装配xor_black_box
比xor_closure_b
评估得更快。 Or put it differently, why the version that should be optimized away by the compiler runs notably slower, and which benchmark variant I should trust? 或换句话说,为什么应该由编译器优化的版本运行速度明显慢,我应该信任哪个基准测试变体?
rustc --version
rustc 1.13.0-nightly (378195665 2016-09-08)
cargo --version
cargo 0.13.0-nightly (afaffa1 2016-09-06)
@Francis Gagné did a fantastic job to investigate the issue in his answer. @FrancisGagné在回答问题时做得非常出色。 The steps that took me to reproduce his results are listed below. 下面列出了促使我重现他的结果的步骤。
Compile benchmarks and emit assembly: 编译基准并发出汇编:
cargo rustc --release -- --test --emit asm
Run produced executable to verify previous behavior: 运行生成的可执行文件以验证以前的行为:
./target/release/deps/xor --bench
running 2 tests
test tests::xor_closure_b ... bench: 925 ns/iter (+/- 9)
test tests::xor_pointer_b ... bench: 0 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
Compile patched assembly output ./target/release/deps/xor.s
编译修补的程序集输出./target/release/deps/xor.s
export RUSTLIB=/path/to/lib/rustlib
gcc target/release/deps/xor.s $RUSTLIB/x86_64-unknown-linux-gnu/lib/*.rlib $RUSTLIB/x86_64-unknown-linux-gnu/lib/*.so -pthread -lpthread -lm -ldl
Run updated benchmarks: 运行更新的基准测试:
./a.out --bench
running 2 tests
test tests::xor_closure_b ... bench: 1 ns/iter (+/- 0)
test tests::xor_pointer_b ... bench: 0 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
I'll focus on comparing xor_closure_b
and xor_pointer_b
, since they should have similar performance (ie they should both do nothing). 我将重点比较xor_closure_b
和xor_pointer_b
,因为它们应该具有相似的性能(即它们都不应该执行任何操作)。
Update: I made a mistake in my initial analysis, as pointed out by @EOF , so I revised the text below. 更新:正如@EOF指出的那样 ,我在最初的分析中犯了一个错误,因此我修改了以下文本。
First, let's look at the LLVM IR generated for those two functions. 首先,让我们看一下为这两个函数生成的LLVM IR。 (I find LLVM IR easier to read than ASM because it's more structured.) (我发现LLVM IR比ASM更易于阅读,因为它的结构更加合理。)
; Function Attrs: uwtable
define internal void @_ZN3xor13xor_closure_b17hb13913a8d2a27b06E(%"11.test::Bencher"* nocapture dereferenceable(32)) unnamed_addr #0 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
%dummy.i.i = alloca {}, align 8
%dummy.i.i.i = alloca i32, align 4
%start1.i = alloca %"1.std::time::Instant", align 8
%tmp_ret2.i = alloca %"1.std::time::Duration", align 8
%1 = bitcast %"1.std::time::Duration"* %tmp_ret2.i to i8*
call void @llvm.lifetime.start(i64 16, i8* %1)
%2 = bitcast %"1.std::time::Instant"* %start1.i to i8*
call void @llvm.lifetime.start(i64 16, i8* %2)
call void @_ZN3std4time7Instant3now17h37bccd496c61083dE(%"1.std::time::Instant"* noalias nocapture nonnull sret dereferenceable(16) %start1.i)
%3 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 0
%4 = load i64, i64* %3, align 8
%5 = icmp eq i64 %4, 0
br i1 %5, label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit, label %bb7.lr.ph.i
bb7.lr.ph.i: ; preds = %entry-block
%6 = bitcast i32* %dummy.i.i.i to i8*
%7 = bitcast {}* %dummy.i.i to i8*
br label %bb7.i
bb7.i: ; preds = %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i", %bb7.lr.ph.i
%iter.sroa.0.019.i = phi i64 [ 0, %bb7.lr.ph.i ], [ %11, %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i" ]
call void @llvm.lifetime.start(i64 4, i8* %6) #2
store i32 1000, i32* %dummy.i.i.i, align 4
call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* nonnull %dummy.i.i.i) #2, !srcloc !1
%8 = load i32, i32* %dummy.i.i.i, align 4
call void @llvm.lifetime.end(i64 4, i8* %6) #2
br label %bb7.i.i.i
bb7.i.i.i: ; preds = %bb7.i.i.i, %bb7.i
%iter.sroa.0.0.i.i.i = phi i32 [ 0, %bb7.i ], [ %iter.sroa.0.1.i.i.i, %bb7.i.i.i ]
%9 = icmp slt i32 %iter.sroa.0.0.i.i.i, %8
%10 = zext i1 %9 to i32
%iter.sroa.0.1.i.i.i = add i32 %10, %iter.sroa.0.0.i.i.i
br i1 %9, label %bb7.i.i.i, label %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"
"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i": ; preds = %bb7.i.i.i
%11 = add nuw i64 %iter.sroa.0.019.i, 1
call void @llvm.lifetime.start(i64 0, i8* %7)
call void asm "", "r,~{dirflag},~{fpsr},~{flags}"({}* nonnull %dummy.i.i) #2, !srcloc !1
call void @llvm.lifetime.end(i64 0, i8* %7)
%exitcond.i = icmp eq i64 %11, %4
br i1 %exitcond.i, label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit, label %bb7.i
_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit: ; preds = %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"
br label %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit
_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit: ; preds = %_ZN4test7Bencher4iter17h0cab611e22e5c5faE.exit.loopexit, %entry-block
call void @_ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE(%"1.std::time::Duration"* noalias nocapture nonnull sret dereferenceable(16) %tmp_ret2.i, %"1.std::time::Instant"* noalias nonnull readonly dereferenceable(16) %start1.i)
%12 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 0
%13 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 1
%14 = load i64, i64* %12, align 8
%15 = load i32, i32* %13, align 8
%16 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 0
store i64 %14, i64* %16, align 8
%17 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 1
store i32 %15, i32* %17, align 4
call void @llvm.lifetime.end(i64 16, i8* %2)
call void @llvm.lifetime.end(i64 16, i8* %1)
ret void
}
; Function Attrs: uwtable
define internal void @_ZN3xor13xor_pointer_b17h7ba0f9760d9fd9f8E(%"11.test::Bencher"* nocapture dereferenceable(32)) unnamed_addr #0 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
%dummy.i.i = alloca {}, align 8
%dummy.i.i.i = alloca i32, align 4
%start1.i = alloca %"1.std::time::Instant", align 8
%tmp_ret2.i = alloca %"1.std::time::Duration", align 8
%1 = bitcast %"1.std::time::Duration"* %tmp_ret2.i to i8*
call void @llvm.lifetime.start(i64 16, i8* %1)
%2 = bitcast %"1.std::time::Instant"* %start1.i to i8*
call void @llvm.lifetime.start(i64 16, i8* %2)
call void @_ZN3std4time7Instant3now17h37bccd496c61083dE(%"1.std::time::Instant"* noalias nocapture nonnull sret dereferenceable(16) %start1.i)
%3 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 0
%4 = load i64, i64* %3, align 8
%5 = icmp eq i64 %4, 0
br i1 %5, label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit, label %bb7.lr.ph.i
bb7.lr.ph.i: ; preds = %entry-block
%6 = bitcast i32* %dummy.i.i.i to i8*
%7 = bitcast {}* %dummy.i.i to i8*
br label %bb7.i
bb7.i: ; preds = %bb7.i, %bb7.lr.ph.i
%iter.sroa.0.019.i = phi i64 [ 0, %bb7.lr.ph.i ], [ %8, %bb7.i ]
%8 = add nuw i64 %iter.sroa.0.019.i, 1
call void @llvm.lifetime.start(i64 4, i8* %6) #2
store i32 1000, i32* %dummy.i.i.i, align 4
call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* nonnull %dummy.i.i.i) #2, !srcloc !1
call void @llvm.lifetime.end(i64 4, i8* %6) #2
call void @llvm.lifetime.start(i64 0, i8* %7)
call void asm "", "r,~{dirflag},~{fpsr},~{flags}"({}* nonnull %dummy.i.i) #2, !srcloc !1
call void @llvm.lifetime.end(i64 0, i8* %7)
%exitcond.i = icmp eq i64 %8, %4
br i1 %exitcond.i, label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit, label %bb7.i
_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit: ; preds = %bb7.i
br label %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit
_ZN4test7Bencher4iter17hae343b1316e5897bE.exit: ; preds = %_ZN4test7Bencher4iter17hae343b1316e5897bE.exit.loopexit, %entry-block
call void @_ZN3std4time7Instant7elapsed17h0b6076720ddfcc2bE(%"1.std::time::Duration"* noalias nocapture nonnull sret dereferenceable(16) %tmp_ret2.i, %"1.std::time::Instant"* noalias nonnull readonly dereferenceable(16) %start1.i)
%9 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 0
%10 = getelementptr inbounds %"1.std::time::Duration", %"1.std::time::Duration"* %tmp_ret2.i, i64 0, i32 1
%11 = load i64, i64* %9, align 8
%12 = load i32, i32* %10, align 8
%13 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 0
store i64 %11, i64* %13, align 8
%14 = getelementptr inbounds %"11.test::Bencher", %"11.test::Bencher"* %0, i64 0, i32 1, i32 1
store i32 %12, i32* %14, align 4
call void @llvm.lifetime.end(i64 16, i8* %2)
call void @llvm.lifetime.end(i64 16, i8* %1)
ret void
}
If we compare the LLVM IR for xor_closure_b
and xor_pointer_b
, they look pretty similar. 如果比较xor_closure_b
和xor_pointer_b
的LLVM IR,它们看起来很相似。 However, one difference stands out: the bb7.iii
block has been optimized out in xor_pointer_b
, but not in xor_closure_b
, for some reason. 但是,有一个区别是突出的:出于某种原因, bb7.iii
块已在xor_pointer_b
进行了优化,但在xor_closure_b
未进行优化。 Here's the block: 这是块:
bb7.i.i.i: ; preds = %bb7.i.i.i, %bb7.i
%iter.sroa.0.0.i.i.i = phi i32 [ 0, %bb7.i ], [ %iter.sroa.0.1.i.i.i, %bb7.i.i.i ]
%9 = icmp slt i32 %iter.sroa.0.0.i.i.i, %8
%10 = zext i1 %9 to i32
%iter.sroa.0.1.i.i.i = add i32 %10, %iter.sroa.0.0.i.i.i
br i1 %9, label %bb7.i.i.i, label %"_ZN3xor13xor_closure_b28_$u7b$$u7b$closure$u7d$$u7d$17hbaf8d82981c57ba0E.exit.i"
And here it is translated to ASM: 此处将其翻译为ASM:
.LBB1_3:
xorl %edi, %edi
cmpl %esi, %ebx
setl %dil
addl %ebx, %edi
cmpl %esi, %ebx
movl %edi, %ebx
jl .LBB1_3
This is a pretty dumb way to loop from 0 to 1000. Changing the code above to this: 这是一种从0到1000循环的相当愚蠢的方法。将上面的代码更改为此:
.LBB1_3:
cmpl %esi, %ebx
jge .LBB1_3a
incl %ebx
cmpl %esi, %ebx
jl .LBB1_3
.LBB1_3a:
makes the benchmark for xor_closure_b
drop from 781 ns/iter (+/- 19)
to 270 ns/iter (+/- 7)
on my machine. 使xor_closure_b
的基准在我的计算机上从781 ns/iter (+/- 19)
降至270 ns/iter (+/- 7)
。
I cannot say for sure why the code generated by the compiler is so slow, or why it wasn't optimized away in the first place (as it was in xor_pointer_b
)... However, it seems that xor_pointer_a
and xor_closure_a
and faster still because the generated code is vectorized, which has the effect that the loop performs fewer iterations (ie the loop is unrolled), by a factor of 32 (eg .LBB0_8
, the main loop in xor_closure_a
, performs 31 iterations, then the rest is handled after the loop). 我不能肯定地说为什么编译器生成的代码这么慢,或者为什么没有首先对其进行优化(就像在xor_pointer_b
)……但是,似乎xor_pointer_a
和xor_closure_a
甚至更快,仍然是因为生成的代码被矢量,其具有循环执行更少的迭代(即,循环被展开)中,由32(例如一个因素的影响.LBB0_8
,在主回路xor_closure_a
,进行31次重复,则其余部分被后处理循环)。
For reference, I compiled the edited ASM with this command line: 作为参考,我使用以下命令行编译了已编辑的ASM:
$ gcc target/release/xor-71758a2519026d86.s ~/.multirust/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/lib{test,term,getopts,rustc_unicode,std,libc,rand,collections,alloc_system,alloc,core,panic_unwind}-411f48d3.rlib -pthread -lpthread -lm -ldl
and I ran it with ./a.out --bench
. 我用./a.out --bench
运行它。 Also, my CPU is an Intel Core i7-4770K. 另外,我的CPU是Intel Core i7-4770K。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.