Why C/C++ is slower than Assembly and other low level languages?

Question

I write a code, doing nothing in C++

void main(void){

}

and Assembly.

.global _start
.text

_start:
    mov $60, %rax
    xor %rdi, %rdi 
    syscall

I compile the C code and compile and link Assembly code. I make a comparison between two executable file with time command.

Assembly

time ./Assembly

real    0m0.001s
user    0m0.000s
sys     0m0.000s

C

time ./C

real    0m0.002s
user    0m0.000s
sys     0m0.000s

Assembly is two times faster than C. I disassemble the codes, in Assembly code, there was only four lines code (Same). In the C code, there was tons of unnecessary code writed for linking main to _start. In main there was four lines code, three of that is writed for making impossible (you can't access to a function's variable from outside of the function blog) to access ' local ' (like function veriables) variables from outside of ' block ' (like function blocks).

push %rbp ; push base pointer.
mov  %rsp, %rbp ; copy value of stack pointer to base pointer, stack pointer is using for saving variables.
pop  %rbp ; 'local' variables are removed, because we pop the base pointer 
retq ; ?

What is why of that?

Answer 1

The amount of time required to execute the core of your program you've written is incredibly small. Figure that it consists of three or four assembly instructions, and at several gigahertz that will only require a couple of nanoseconds to run. That's such a small amount of time that it's vastly below the detection threshold for the time program, whose resolution is measured in milliseconds (remember that a millisecond is a million times slower than a nanosecond!) So in that sense, I would be very careful about making judgments about the runtime of one program as being "twice as fast" as the other; the resolution of your timer isn't high enough to say that for certain. You might just be seeing noise terms.

Your question, though, was why there is all this automatically generated code if nothing is going to happen. The answer is "it depends." With no optimization turned on, most compilers generate assembly code that faithfully simulates the program you wrote, possibly doing more work than is necessary. Since most C and C++ functions, you actually will have code that does something, will need local variables, etc., a compiler wouldn't be too wrong in emitting code at the start and end of a function to set up the stack and frame pointer properly to support those variables. With optimization turned up to the max, an optimizing compiler might be smart enough to notice that this isn't necessary and to remove that code, but it's not required.

In principle, a perfect compiler would always emit the fastest code possible, but it turns out that it's impossible to build a compiler that will always do this (this has to do with things like the undecidability of the halting problem). Therefore, it's somewhat assumed that the code generated will be good - even great - but not optimal. However, it's a tradeoff. Yes, the code might not be as fast as it could possibly be, but by working in languages like C and C++ it's possible to write large and complex programs in a way that's (compared to assembly) easy to read, easy to write, and easy to maintain. We're okay with the slight performance hit because in practice it's not too bad and most optimizing compilers are good enough to make the price negligible (or even negative, if the optimizing compiler finds a better approach to solving a problem than the human!)

To summarize:

Your timing mechanism is probably not sufficient to make the conclusions that you're making. You'll need a higher-precision timer than that.
Compilers often generate unnecessary code in the interest of simplicity. Optimizing compilers often remove that code, but can't always.
We're okay paying the cost of using higher-level languages in terms of raw runtime because of the ease of development. In fact, it might actually be a net win to use a high-level language with a good optimizing compiler, since it offloads the optimization complexity.

Answer 2

All the extra time from C is dynamic linker and CRT overhead. The asm program is statically linked, and just calls exit(2) (the sycall directly, not the glibc wrapper). Of course it's faster, but it's just startup overhead and doesn't tell you anything about how fast compiler-emitted code that actually does anything will run.

ie if you wrote some code to actually do something in C, and compiled it with gcc -O3 -march=native , you'd expect it to be ~0.001 seconds slower than a statically linked binary with no CRT overhead. (If the your hand-written asm and the compiler output were both near-optimal. eg if you used the compiler output as a starting point for a hand-optimized version, but didn't find anything major. It's usually possible to make some improvements to compiler output, but often just to code-size and probably not much effect on speed.)

If you want to call malloc or printf , then the startup overhead is not useless; it's actually necessary to initialize glibc internal data structures so that library functions don't have any overhead of checking that stuff is initialized every time they're called.

From a statically linked hand-written asm program that links glibc, you need to call __libc_init_first , __dl_tls_setup , and __libc_csu_init , in that order , before you can safely use all libc functions.

Anyway, ideally you can expect a constant time difference from the startup overhead, not a factor of 2 difference.

If you're good at writing optimal asm, you can usually do a better job than the compiler on a local scale, but compilers are really good at global optimizations. Moreover, they do it in seconds of CPU time (very cheap) instead of weeks of human effort (very precious).

It can make sense to hand-craft a critical loop, eg as part of a video encoder, but even video encoders (like x264, x264, and vpx) have most of the logic written in C or C++, and just call asm functions.

The extra push/mov/pop instructions are because you compiled with optimization disabled , where -fno-omit-frame-pointer is the default , and makes a stack frame even for leaf functions. gcc defaults to -fomit-frame-pointer at -O1 and higher on x86 and x86-64 (since modern debug metadata formats mean it's not needed for debugging or exception-handling stack unwinding).

If you'd told your C compiler to make fast code ( -O3 ), instead of to compile quickly and make dumb code that works well in a debugger ( -O0 ), you would have gotten code like this for main (from the Godbolt compiler explorer ):

// this is valid C++ and C99, but C89 doesn't have an implicit return 0 in main.  
int main(void) {}

    xor     eax, eax
    ret

To learn more about assembly and how everything works, have a look at some of the links in the x86 tag wiki. Perhaps Programming From the Ground Up would be a good start; it probably explains compilers and dynamic linking.

A much shorter article is A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux , which starts with what you did, and then gets down to having _start overlap with some other ELF headers so the file can be even smaller.

Answer 3

Did you compile with optimizations enabled? If not, then this is invalid.
Did you consider that this is a completely trivial example that will have no real-life performance implications worth writing even a postcard about?

Please write clear maintainable code and (in 99% of cases) leave the optimization to the compiler. Please.

Why C/C++ is slower than Assembly and other low level languages?

Question

3 answers

solution1
15 ACCPTED 2016-06-20 21:53:40

solution2
4 2016-06-20 23:34:25

solution3
0 2016-06-20 21:58:17

Why C/C++ is slower than Assembly and other low level languages?

Question

3 answers

solution1 15 ACCPTED 2016-06-20 21:53:40

solution2 4 2016-06-20 23:34:25

solution3 0 2016-06-20 21:58:17

solution1
15 ACCPTED 2016-06-20 21:53:40

solution2
4 2016-06-20 23:34:25

solution3
0 2016-06-20 21:58:17