[英]Why does clang produce a much faster code than gcc for this simple function involving exponentiation?
The following code compiled with clang
runs almost 60 times faster than the one compiled with gcc
with identical compiler flags (either -O2
or -O3
): 使用
clang
编译的以下代码比使用具有相同编译器标志( -O2
或-O3
)的gcc
编译的代码运行快近60倍:
#include <iostream>
#include <math.h>
#include <chrono>
#include <limits>
long double func(int num)
{
long double i=0;
long double k=0.7;
for(int t=1; t<num; t++){
for(int n=1; n<16; n++){
i += pow(k,n);
}
}
return i;
}
int main()
{
volatile auto num = 3000000; // avoid constant folding
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
auto i = func(num);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed = end-start;
std::cout.precision(std::numeric_limits<long double>::max_digits10);
std::cout << "Result " << i << std::endl;
std::cout << "Elapsed time is " << elapsed.count() << std::endl;
return 0;
}
I have tested this with three gcc
versions 4.8.4/4.9.2/5.2.1
and two clang
versions 3.5.1/3.6.1
and here are the timings on my machine (for gcc 5.2.1
and clang 3.6.1
): 我用三个
gcc
版本4.8.4/4.9.2/5.2.1
和两个clang
版本3.5.1/3.6.1
测试了这个,这是我机器上的时间(对于gcc 5.2.1
和clang 3.6.1
) :
Timing -O3
: 时间
-O3
:
gcc: 2.41888s
clang: 0.0396217s
Timing -O2
: 时间
-O2
:
gcc: 2.41024s
clang: 0.0395114s
Timing -O1
: 时间
-O1
:
gcc: 2.41766s
clang: 2.43113s
So it seems that gcc
does not optimize this function at all even at higher optimization levels. 因此,即使在更高的优化级别,
gcc
似乎也不会优化此功能。 The assembly output of clang
is almost around 100 lines longer than gcc
and I don't think it is necessary to post it here, all I can say is that in gcc
assembly output there is a call to pow
which does not appear in clang
assembly, presumably because clang
optimizes it to a bunch of intrinsic calls. clang
的组装输出几乎比gcc
大约100行,我认为没有必要在这里发布,我只能说在gcc
汇编输出中有一个调用pow
,它不会出现在clang
程序集中,大概是因为clang
它优化为一堆内在的调用。
Since the results are identical (ie i = 6966764.74717416727754
), the question is: 由于结果相同(即
i = 6966764.74717416727754
),问题是:
gcc
not optimize this function when clang
can? gcc
不能在clang
时优化这个功能? k
to 1.0
and gcc
becomes as fast, is there a floating point arithmetic issue that gcc
cannot by-pass? k
的值更改为1.0
并且gcc
变得一样快,是否存在gcc
无法绕过的浮点运算问题? I did try static_cast
ing and turned on the warnings to see if there was any issue with implicit conversions, but not really. 我确实尝试过
static_cast
并打开警告,看看隐式转换是否有任何问题,但不是真的。
Update: For completeness here are the results for -Ofast
更新:为了完整性,这里是
-Ofast
的结果
gcc: 0.00262204s
clang: 0.0013267s
The point is that gcc
does not optimize the code at O2/O3
. 关键是
gcc
没有优化O2/O3
的代码。
From this godbolt session clang is able to perform all the pow
calculations at compile time. 从这个godbolt会话中, clang能够在编译时执行所有的
pow
计算。 It knows at compile time what the values of k
and n
are and it just constant folds the calculation: 它在编译时知道
k
和n
的值是什么,它只是常数折叠计算:
.LCPI0_0:
.quad 4604480259023595110 # double 0.69999999999999996
.LCPI0_1:
.quad 4602498675187552091 # double 0.48999999999999994
.LCPI0_2:
.quad 4599850558606658239 # double 0.34299999999999992
.LCPI0_3:
.quad 4597818534454788671 # double 0.24009999999999995
.LCPI0_4:
.quad 4595223380205512696 # double 0.16806999999999994
.LCPI0_5:
.quad 4593141924544133109 # double 0.11764899999999996
.LCPI0_6:
.quad 4590598673379842654 # double 0.082354299999999963
.LCPI0_7:
.quad 4588468774839143248 # double 0.057648009999999972
.LCPI0_8:
.quad 4585976388698138603 # double 0.040353606999999979
.LCPI0_9:
.quad 4583799016135705775 # double 0.028247524899999984
.LCPI0_10:
.quad 4581356477717521223 # double 0.019773267429999988
.LCPI0_11:
.quad 4579132580613789641 # double 0.01384128720099999
.LCPI0_12:
.quad 4576738892963968780 # double 0.0096889010406999918
.LCPI0_13:
.quad 4574469401809764420 # double 0.0067822307284899942
.LCPI0_14:
.quad 4572123587912939977 # double 0.0047475615099429958
and it unrolls the inner loop: 并且它展开内循环:
.LBB0_2: # %.preheader
faddl .LCPI0_0(%rip)
faddl .LCPI0_1(%rip)
faddl .LCPI0_2(%rip)
faddl .LCPI0_3(%rip)
faddl .LCPI0_4(%rip)
faddl .LCPI0_5(%rip)
faddl .LCPI0_6(%rip)
faddl .LCPI0_7(%rip)
faddl .LCPI0_8(%rip)
faddl .LCPI0_9(%rip)
faddl .LCPI0_10(%rip)
faddl .LCPI0_11(%rip)
faddl .LCPI0_12(%rip)
faddl .LCPI0_13(%rip)
faddl .LCPI0_14(%rip)
Note, that it is using a builtin function( gcc documents theirs here ) to calculate pow
at compile time and if we use -fno-builtin it no longer performs this optimization. 注意,它使用内置函数( gcc在这里记录它们 )在编译时计算
pow
,如果我们使用-fno-builtin它不再执行此优化。
If you change k
to 1.0
then gcc is able to perform the same optimization: 如果将
k
更改为1.0
则gcc可以执行相同的优化:
.L3:
fadd %st, %st(1) #,
addl $1, %eax #, t
cmpl %eax, %edi # t, num
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
fadd %st, %st(1) #,
jne .L3 #,
Although it is a simpler case. 虽然这是一个更简单的案例。
If you change the condition for the inner loop to n < 4
then gcc seems willing to optimize when k = 0.7
. 如果将内循环的条件更改为
n < 4
则gcc似乎愿意在k = 0.7
时进行优化 。 As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off. 正如对问题的评论中所指出的,如果编译器不相信展开将有所帮助,那么由于存在代码大小权衡,它将展开多少将是保守的。
As indicated in the comments I am using a modified version of the OP's code in the godbolt examples but it does not change the underlying conclusion. 正如评论中所示,我在使用Godbolt示例中使用OP代码的修改版本,但它并未改变基本结论。
Note as indicated in a comment above if we use -fno-math-errno , which stops errno
from being set, gcc does apply a similar optimization . 注意如上面的注释中所示,如果我们使用-fno-math-errno ,它会阻止
errno
被设置,gcc会应用类似的优化 。
In addition to Shafik Yaghmour's answer, I'd like to point out that the reason your use of volatile
on the variable num
appears to have no effect is that num
is read before func
is even called. 除了Shafik Yaghmour的回答之外,我想指出你在变量
num
上使用volatile
似乎没有效果的原因是在func
被调用之前读取了num
。 The read can't be optimized away, but the function call can still be optimized away. 无法优化读取,但仍可以优化函数调用。 If you declared the parameter of
func
to be a reference to volatile
, ie. 如果你声明
func
的参数是对volatile
的引用,即。 long double func(volatile int& num)
, this would prevent the compiler from optimizing away the entire call to func
. long double func(volatile int& num)
,这会阻止编译器优化掉对func
的整个调用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.