Here is a very simple C program:
int main()
{
int n = 0;
while(n != 1000000000){
n += 1;
}
return n;
}
I compiled it with Clang and timed it. It ran in 4.711095243692398e-06
seconds, or 0.000004711095243692398
seconds.
Next I output the C program to Intel syntax assembly language using the Godbolt compiler explorer ( https://godbolt.org ) to remove the .cfi
directives:
.file "Svx.c"
.intel_syntax noprefix
.text
.globl main
.type main, @function
main:
push rbp
mov rbp, rsp
mov DWORD PTR -4[rbp], 0
jmp .L2
.L3:
add DWORD PTR -4[rbp], 1
.L2:
cmp DWORD PTR -4[rbp], 1000000000
jne .L3
mov eax, DWORD PTR -4[rbp]
pop rbp
ret
.size main, .-main
.ident "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
.section .note.GNU-stack,"",@progbits
I compiled it with GCC and timed it. The result was 1.96
seconds -- vastly slower than the Clang version.
Finally, I created my own assembly version:
[BITS 64]
[default rel]
global main:function
section .data align=16
section .text
main:
xor rax,rax
l_01:
cmp rax,1000000000
je l_02
add rax,5
jmp l_01
l_02:
ret
I compiled it with nasm
and linked it with ld
:
sudo nasm -felf64 Svx.asm
sudo ld -shared Svx.o -o Svx.so
and timed it. It ran in 0.14707629615440965
seconds.
Why does the C version run so fast if the reverse-compiled version runs vastly slower ( 0.0000047
seconds vs 1.96
seconds) and my NASM version runs in 0.147
seconds? I have a feeling that the result from the C version at 0.0000047
seconds is wrong; it seems impossibly fast. This is its Clang output to assembly language:
.text
.intel_syntax noprefix
.file "Svx.c"
.globl main # -- Begin function main
.p2align 4, 0x90
.type main,@function
main: # @main
.cfi_startproc
# %bb.0:
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
mov dword ptr [rbp - 4], 0
.LBB0_1: # =>This Inner Loop Header: Depth=1
cmp dword ptr [rbp - 4], 1000000000
je .LBB0_3
# %bb.2: # in Loop: Header=BB0_1 Depth=1
mov eax, dword ptr [rbp - 4]
add eax, 1
mov dword ptr [rbp - 4], eax
jmp .LBB0_1
.LBB0_3:
mov eax, dword ptr [rbp - 4]
pop rbp
.cfi_def_cfa rsp, 8
ret
.Lfunc_end0:
.size main, .Lfunc_end0-main
.cfi_endproc
# -- End function
.ident "clang version 8.0.0-3~ubuntu18.04.1 (tags/RELEASE_800/final)"
.section ".note.GNU-stack","",@progbits
.addrsig
The listing shows they are using the stack for variables, not a register, which is (usually) slower.
The speed, at 0.0000047
seconds, seems impossibly fast to count to a billion. If that speed is correct, what's its secret? Reverse engineering doesn't reveal anything and in fact the Godbolt version is much slower.
Clang is just realizing that this loop is run 1000000000
times and doing the equivalent of return 1000000000;
.
My output with -O3
as you specified that you're using:
.text
.file "test.c"
.globl main # -- Begin function main
.p2align 4, 0x90
.type main,@function
main: # @main
.cfi_startproc
# %bb.0:
movl $1000000000, %eax # imm = 0x3B9ACA00
retq
.Lfunc_end0:
.size main, .Lfunc_end0-main
.cfi_endproc
# -- End function
.ident "clang version 8.0.0 (tags/RELEASE_800/final)"
.section ".note.GNU-stack","",@progbits
.addrsig
Notice the contents of main
:
# %bb.0:
movl $1000000000, %eax # imm = 0x3B9ACA00
retq
This removes the loop entirely and just returns 1000000000
.
A trick to get around this is to use volatile
:
int main(void)
{
volatile int n = 0;
while(n != 1000000000) {
n += 1;
}
return n;
}
Output (again with -O3
):
.text
.file "test.c"
.globl main # -- Begin function main
.p2align 4, 0x90
.type main,@function
main: # @main
.cfi_startproc
# %bb.0:
movl $0, -4(%rsp)
movl -4(%rsp), %ecx
movl -4(%rsp), %eax
cmpl $1000000000, %ecx # imm = 0x3B9ACA00
je .LBB0_3
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
addl $1, %eax
movl %eax, -4(%rsp)
movl -4(%rsp), %ecx
movl -4(%rsp), %eax
cmpl $1000000000, %ecx # imm = 0x3B9ACA00
jne .LBB0_1
.LBB0_3:
retq
.Lfunc_end0:
.size main, .Lfunc_end0-main
.cfi_endproc
# -- End function
.ident "clang version 8.0.0 (tags/RELEASE_800/final)"
.section ".note.GNU-stack","",@progbits
.addrsig
You have 3 cases:
mov eax, 1000000000
/ ret
. The time you measured is all overhead. The compiler can easily prove that n
must have that value for the loop to exit, and that the loop is not infinite. Hand-written asm that loops (inefficiently) but does at least keep values in registers. So the loop-carried dependency chain (incrementing n
) only has 1 cycle latency.
And since you use add rax,5
, you're only doing 1/5th of the iterations of an n++
loop. You could think of this as unrolling by 5 and then optimizing 5x n++
into n+=5
. You can make this factor as large as you want and make the running time arbitrarily small, until you get to mov eax, 1000000000
like the compiler did.
See the first 2 on Godbolt where I used clang -O3
and gcc -O0
. Note that int n
is a 32-bit variable in standard x86-64 ABIs so there's no need to spend extra code size (REX prefixes) for 64-bit operand-size.
See Why are loops always compiled into "do...while" style (tail jump)? for why efficient simple loops have the conditional branch at the bottom and no unconditional branch. Note that this is what gcc does even at -O0
(entering the loop with a jmp
to the loop condition at the bottom).
Clang makes even more naive code at -O0
that has the same structure as the C with a break condition at the top and an unconditional jmp
at the bottom, like your hand-written asm.
So your asm should be about 6 * 5 times faster than the anti-optimized C compiler output, or half that if it doesn't manage to run your NASM loop at 1 clock cycle per iteration. In practice you measured a factor of 13.333 which is pretty close to 15.
So you probably have an Intel before Haswell or AMD before Ryzen . More recent CPUs have a throughput of 2 branches per clock if at least one of them is not-taken.
Or a Skylake (loop-buffer disabled by a microcode update) and some front-end effect (like having the loop split across a 64-byte boundary) stopped it from issuing at 1 iter / clock because it couldn't read from uop cache fast enough.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.