Consider this simple code:
#include <complex.h>
complex double f(complex double x, complex double y) {
return x/y;
}
In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:
f:
vmulsd xmm4, xmm1, xmm3
vmovapd xmm6, xmm0
vmulsd xmm5, xmm3, xmm3
vmulsd xmm6, xmm6, xmm3
vfmadd231sd xmm4, xmm0, xmm2
vfmadd231sd xmm5, xmm2, xmm2
vfmsub132sd xmm1, xmm6, xmm2
vdivsd xmm0, xmm4, xmm5
vdivsd xmm1, xmm1, xmm5
ret
This makes sense and is easy to understand. However the Intel C Compiler gives:
f:
fld1 #3.12
vmovsd QWORD PTR [-24+rsp], xmm2 #3.12
fld QWORD PTR [-24+rsp] #3.12
vmovsd QWORD PTR [-24+rsp], xmm3 #3.12
fld st(0) #3.12
fmul st, st(1) #3.12
fld QWORD PTR [-24+rsp] #3.12
fld st(0) #3.12
fmul st, st(1) #3.12
vmovsd QWORD PTR [-24+rsp], xmm0 #3.12
faddp st(2), st #3.12
fxch st(1) #3.12
fdivp st(3), st #3.12
fld QWORD PTR [-24+rsp] #3.12
vmovsd QWORD PTR [-24+rsp], xmm1 #3.12
fld st(0) #3.12
fmul st, st(3) #3.12
fxch st(1) #3.12
fmul st, st(2) #3.12
fld QWORD PTR [-24+rsp] #3.12
fld st(0) #3.12
fmulp st(4), st #3.12
fxch st(3) #3.12
faddp st(2), st #3.12
fxch st(1) #3.12
fmul st, st(4) #3.12
fstp QWORD PTR [-16+rsp] #3.12
fxch st(2) #3.12
fmulp st(1), st #3.12
vmovsd xmm0, QWORD PTR [-16+rsp] #3.12
fsubrp st(1), st #3.12
fmulp st(1), st #3.12
fstp QWORD PTR [-16+rsp] #3.12
vmovsd xmm1, QWORD PTR [-16+rsp] #3.12
ret
Can anyone explain what it is doing and whether it is in fact faster than gcc's approach?
I can't benchmark the code myself as I don't have the ICC. The ICC assembly is created using https://godbolt.org/g/ZXZGy2 .
As requested by the question and some comments, I ran a quick benchmark to compare the performance of the GCC and ICC compilers on this bit of C code.
Hardware setup
The machine that was used to run the tests features an AMD A8-5550M APU quad-core processor, with a frequency of 2.1 GHz. Caches sizes are 16k for L1i, 64k for L1d and 2048K for L2.
Experimental setup
I don't own a copy of the ICC compiler, so the assembly code listed in the question was directly used for this benchmark. The two assembly outputs were compiled using the NASM assembler. Some minor syntactic changes were required to make the ICC version compatible, but of course nothing changing the functionality or affecting the performance in any way. A small C wrapper was written to call the two assembly functions and monitor timings.
Here is a version of the code similar to the one that was used for this simple benchmark:
#include <stdio.h>
#include <complex.h>
#include <time.h>
extern complex double gcc_f(complex double x, complex double y);
extern complex double icc_f(complex double x, complex double y);
int main() {
struct timespec stop, start;
complex double z1 = 1.0654575 + 3.0678788768 * I;
complex double z2 = 2.225 - 8.0 * I;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
for(int i =0; i < 1000000000; ++i) {
icc_f(z1, z2);
// gcc_f(z1, z2);
}
clock_gettime(CLOCK_MONOTONIC_RAW, &stop);
printf("Execution took %luns\n", ((stop.tv_sec - start.tv_sec) * 1000000000 + (stop.tv_nsec - start.tv_nsec)));
return 0;
}
Results
Both timings were averaged on a billion executions.
The GCC version took on average 8.8ns per execution.
The ICC version took on average 17.3ns per execution.
Therefore, the GCC compiler outperforms the ICC compiler by a significant margin, at least with the particular hardware setup described above. GCC seems to make a more clever usage of the AVX instruction set in this case.
As a side note, quite interestingly, if you compile with -Ofast
instead of -O3
, the ICC version looks more similar to the GCC version:
f:
vunpcklpd xmm4, xmm2, xmm3 #2.54
vunpcklpd xmm6, xmm0, xmm1 #2.54
vunpckhpd xmm5, xmm4, xmm4 #3.12
vmulpd xmm10, xmm4, xmm4 #3.12
vmulpd xmm8, xmm5, xmm6 #3.12
vmovddup xmm9, xmm4 #3.12
vshufpd xmm7, xmm6, xmm6, 1 #3.12
vshufpd xmm11, xmm10, xmm10, 1 #3.12
vfmaddsub213pd xmm9, xmm7, xmm8 #3.12
vaddpd xmm13, xmm10, xmm11 #3.12
vshufpd xmm12, xmm9, xmm9, 1 #3.12
vdivpd xmm0, xmm12, xmm13 #3.12
vunpckhpd xmm1, xmm0, xmm0 #3.12
ret
This alternative ICC version is significantly faster, on average 9.0ns per execution, but is still slightly behind the GCC version. Nevertheless, such small differences are probably tied to the experimental setup.
Add the compiler flag:
-fp-model fast=2
This is the ICC equivalent of -ffast-math (On godbolt you can check the compiler output by clicking on the warning triangle option)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.