Three versions of the same C program, why is the first one so fast?

Question

Here is a very simple C program:

int main()
{
    int n = 0;

    while(n != 1000000000){
        n += 1;
    }

return n;

}

I compiled it with Clang and timed it. It ran in 4.711095243692398e-06 seconds, or 0.000004711095243692398 seconds.

Next I output the C program to Intel syntax assembly language using the Godbolt compiler explorer ( https://godbolt.org ) to remove the .cfi directives:

.file   "Svx.c"
.intel_syntax noprefix
.text
.globl  main
.type   main, @function
main:
    push    rbp
    mov rbp, rsp
    mov DWORD PTR -4[rbp], 0
    jmp .L2
.L3:
    add DWORD PTR -4[rbp], 1
.L2:
    cmp DWORD PTR -4[rbp], 1000000000
    jne .L3
    mov eax, DWORD PTR -4[rbp]
    pop rbp
    ret
    .size   main, .-main
    .ident  "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
    .section    .note.GNU-stack,"",@progbits

I compiled it with GCC and timed it. The result was 1.96 seconds -- vastly slower than the Clang version.

Finally, I created my own assembly version:

[BITS 64]
[default rel]

global main:function

section .data align=16

section .text

main:
xor rax,rax
l_01:
cmp rax,1000000000
je l_02
add rax,5
jmp l_01
l_02:
ret

I compiled it with nasm and linked it with ld :

sudo nasm -felf64 Svx.asm

sudo ld -shared Svx.o -o Svx.so

and timed it. It ran in 0.14707629615440965 seconds.

Why does the C version run so fast if the reverse-compiled version runs vastly slower ( 0.0000047 seconds vs 1.96 seconds) and my NASM version runs in 0.147 seconds? I have a feeling that the result from the C version at 0.0000047 seconds is wrong; it seems impossibly fast. This is its Clang output to assembly language:

    .text
    .intel_syntax noprefix
    .file   "Svx.c"
    .globl  main                     # -- Begin function main
    .p2align    4, 0x90
    .type   main,@function
main:                                    # @main
    .cfi_startproc
# %bb.0:
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset rbp, -16
    mov rbp, rsp
    .cfi_def_cfa_register rbp
    mov dword ptr [rbp - 4], 0
.LBB0_1:                                # =>This Inner Loop Header:     Depth=1
    cmp dword ptr [rbp - 4], 1000000000
    je  .LBB0_3
# %bb.2:                                #   in Loop: Header=BB0_1   Depth=1
    mov eax, dword ptr [rbp - 4]
    add eax, 1
    mov dword ptr [rbp - 4], eax
    jmp .LBB0_1
.LBB0_3:
    mov eax, dword ptr [rbp - 4]
    pop rbp
    .cfi_def_cfa rsp, 8
    ret
.Lfunc_end0:
    .size   main, .Lfunc_end0-main
    .cfi_endproc
                                        # -- End function
    .ident  "clang version 8.0.0-3~ubuntu18.04.1 (tags/RELEASE_800/final)"
    .section    ".note.GNU-stack","",@progbits
    .addrsig

The listing shows they are using the stack for variables, not a register, which is (usually) slower.

The speed, at 0.0000047 seconds, seems impossibly fast to count to a billion. If that speed is correct, what's its secret? Reverse engineering doesn't reveal anything and in fact the Godbolt version is much slower.

Answer 1

Clang is just realizing that this loop is run 1000000000 times and doing the equivalent of return 1000000000; .

My output with -O3 as you specified that you're using:

        .text
        .file   "test.c"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        movl    $1000000000, %eax       # imm = 0x3B9ACA00
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .ident  "clang version 8.0.0 (tags/RELEASE_800/final)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

Notice the contents of main :

# %bb.0:
        movl    $1000000000, %eax       # imm = 0x3B9ACA00
        retq

This removes the loop entirely and just returns 1000000000 .

A trick to get around this is to use volatile :

int main(void)
{
    volatile int n = 0;

    while(n != 1000000000) {
        n += 1;
    }

    return n;
}

Output (again with -O3 ):

        .text
        .file   "test.c"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        movl    $0, -4(%rsp)
        movl    -4(%rsp), %ecx
        movl    -4(%rsp), %eax
        cmpl    $1000000000, %ecx       # imm = 0x3B9ACA00
        je      .LBB0_3
        .p2align        4, 0x90
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        addl    $1, %eax
        movl    %eax, -4(%rsp)
        movl    -4(%rsp), %ecx
        movl    -4(%rsp), %eax
        cmpl    $1000000000, %ecx       # imm = 0x3B9ACA00
        jne     .LBB0_1
.LBB0_3:
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .ident  "clang version 8.0.0 (tags/RELEASE_800/final)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

Answer 2

You have 3 cases:

C with optimization enabled: replace the loop with its result: mov eax, 1000000000 / ret . The time you measured is all overhead. The compiler can easily prove that n must have that value for the loop to exit, and that the loop is not infinite.
C with optimization disabled : keep C variables in memory, including the loop counter. The loop bottlenecks on store-forwarding latency at about 6 cycles per iteration on modern x86 CPUs. ( https://agner.org/optimize/ )
Hand-written asm that loops (inefficiently) but does at least keep values in registers. So the loop-carried dependency chain (incrementing n ) only has 1 cycle latency.
And since you use add rax,5 , you're only doing 1/5th of the iterations of an n++ loop. You could think of this as unrolling by 5 and then optimizing 5x n++ into n+=5 . You can make this factor as large as you want and make the running time arbitrarily small, until you get to mov eax, 1000000000 like the compiler did.

See the first 2 on Godbolt where I used clang -O3 and gcc -O0 . Note that int n is a 32-bit variable in standard x86-64 ABIs so there's no need to spend extra code size (REX prefixes) for 64-bit operand-size.

See Why are loops always compiled into "do...while" style (tail jump)? for why efficient simple loops have the conditional branch at the bottom and no unconditional branch. Note that this is what gcc does even at -O0 (entering the loop with a jmp to the loop condition at the bottom).

Clang makes even more naive code at -O0 that has the same structure as the C with a break condition at the top and an unconditional jmp at the bottom, like your hand-written asm.

So your asm should be about 6 * 5 times faster than the anti-optimized C compiler output, or half that if it doesn't manage to run your NASM loop at 1 clock cycle per iteration. In practice you measured a factor of 13.333 which is pretty close to 15.

So you probably have an Intel before Haswell or AMD before Ryzen . More recent CPUs have a throughput of 2 branches per clock if at least one of them is not-taken.

Or a Skylake (loop-buffer disabled by a microcode update) and some front-end effect (like having the loop split across a 64-byte boundary) stopped it from issuing at 1 iter / clock because it couldn't read from uop cache fast enough.

Three versions of the same C program, why is the first one so fast?

Question

2 answers

solution1
2 ACCPTED 2019-09-22 15:36:11

solution2
2 2019-09-22 20:18:43

Three versions of the same C program, why is the first one so fast?

Question

2 answers

solution1 2 ACCPTED 2019-09-22 15:36:11

solution2 2 2019-09-22 20:18:43

solution1
2 ACCPTED 2019-09-22 15:36:11

solution2
2 2019-09-22 20:18:43