为什么clang产生比gcc更快的代码，这个涉及求幂的简单函数？

Question

The following code compiled with clang runs almost 60 times faster than the one compiled with gcc with identical compiler flags (either -O2 or -O3 ): 使用clang编译的以下代码比使用具有相同编译器标志（ -O2或-O3 ）的gcc编译的代码运行快近60倍：

#include <iostream>
#include <math.h> 
#include <chrono>
#include <limits>

long double func(int num)
{
    long double i=0;
    long double k=0.7;

    for(int t=1; t<num; t++){
      for(int n=1; n<16; n++){
        i += pow(k,n);
      }
    }
    return i;
}


int main()
{
   volatile auto num = 3000000; // avoid constant folding

   std::chrono::time_point<std::chrono::system_clock> start, end;
   start = std::chrono::system_clock::now();

   auto i = func(num);

   end = std::chrono::system_clock::now();
   std::chrono::duration<double> elapsed = end-start;
   std::cout.precision(std::numeric_limits<long double>::max_digits10);
   std::cout << "Result " << i << std::endl;
   std::cout << "Elapsed time is " << elapsed.count() << std::endl;

   return 0;
}

I have tested this with three gcc versions 4.8.4/4.9.2/5.2.1 and two clang versions 3.5.1/3.6.1 and here are the timings on my machine (for gcc 5.2.1 and clang 3.6.1 ): 我用三个gcc版本4.8.4/4.9.2/5.2.1和两个clang版本3.5.1/3.6.1测试了这个，这是我机器上的时间（对于gcc 5.2.1和clang 3.6.1 ）：

Timing -O3 : 时间-O3 ：

gcc:    2.41888s
clang:  0.0396217s

Timing -O2 : 时间-O2 ：

gcc:    2.41024s
clang:  0.0395114s

Timing -O1 : 时间-O1 ：

gcc:    2.41766s
clang:  2.43113s

So it seems that gcc does not optimize this function at all even at higher optimization levels. 因此，即使在更高的优化级别， gcc似乎也不会优化此功能。 The assembly output of clang is almost around 100 lines longer than gcc and I don't think it is necessary to post it here, all I can say is that in gcc assembly output there is a call to pow which does not appear in clang assembly, presumably because clang optimizes it to a bunch of intrinsic calls. clang的组装输出几乎比gcc大约100行，我认为没有必要在这里发布，我只能说在gcc汇编输出中有一个调用pow ，它不会出现在clang程序集中，大概是因为clang它优化为一堆内在的调用。

Since the results are identical (ie i = 6966764.74717416727754 ), the question is: 由于结果相同（即i = 6966764.74717416727754 ），问题是：

Why can gcc not optimize this function when clang can? 为什么gcc不能在clang时优化这个功能？
Change the value of k to 1.0 and gcc becomes as fast, is there a floating point arithmetic issue that gcc cannot by-pass? 将k的值更改为1.0并且gcc变得一样快，是否存在gcc无法绕过的浮点运算问题？

I did try static_cast ing and turned on the warnings to see if there was any issue with implicit conversions, but not really. 我确实尝试过static_cast并打开警告，看看隐式转换是否有任何问题，但不是真的。

Update: For completeness here are the results for -Ofast 更新：为了完整性，这里是-Ofast的结果

gcc:    0.00262204s
clang:  0.0013267s

The point is that gcc does not optimize the code at O2/O3 . 关键是gcc没有优化O2/O3的代码。

Answer 1

From this godbolt session clang is able to perform all the pow calculations at compile time. 从这个godbolt会话中， clang能够在编译时执行所有的pow计算。 It knows at compile time what the values of k and n are and it just constant folds the calculation: 它在编译时知道k和n的值是什么，它只是常数折叠计算：

.LCPI0_0:
    .quad   4604480259023595110     # double 0.69999999999999996
.LCPI0_1:
    .quad   4602498675187552091     # double 0.48999999999999994
.LCPI0_2:
    .quad   4599850558606658239     # double 0.34299999999999992
.LCPI0_3:
    .quad   4597818534454788671     # double 0.24009999999999995
.LCPI0_4:
    .quad   4595223380205512696     # double 0.16806999999999994
.LCPI0_5:
    .quad   4593141924544133109     # double 0.11764899999999996
.LCPI0_6:
    .quad   4590598673379842654     # double 0.082354299999999963
.LCPI0_7:
    .quad   4588468774839143248     # double 0.057648009999999972
.LCPI0_8:
    .quad   4585976388698138603     # double 0.040353606999999979
.LCPI0_9:
    .quad   4583799016135705775     # double 0.028247524899999984
.LCPI0_10:
    .quad   4581356477717521223     # double 0.019773267429999988
.LCPI0_11:
    .quad   4579132580613789641     # double 0.01384128720099999
.LCPI0_12:
    .quad   4576738892963968780     # double 0.0096889010406999918
.LCPI0_13:
    .quad   4574469401809764420     # double 0.0067822307284899942
.LCPI0_14:
    .quad   4572123587912939977     # double 0.0047475615099429958

and it unrolls the inner loop: 并且它展开内循环：

.LBB0_2:                                # %.preheader
    faddl   .LCPI0_0(%rip)
    faddl   .LCPI0_1(%rip)
    faddl   .LCPI0_2(%rip)
    faddl   .LCPI0_3(%rip)
    faddl   .LCPI0_4(%rip)
    faddl   .LCPI0_5(%rip)
    faddl   .LCPI0_6(%rip)
    faddl   .LCPI0_7(%rip)
    faddl   .LCPI0_8(%rip)
    faddl   .LCPI0_9(%rip)
    faddl   .LCPI0_10(%rip)
    faddl   .LCPI0_11(%rip)
    faddl   .LCPI0_12(%rip)
    faddl   .LCPI0_13(%rip)
    faddl   .LCPI0_14(%rip)

Note, that it is using a builtin function( gcc documents theirs here ) to calculate pow at compile time and if we use -fno-builtin it no longer performs this optimization. 注意，它使用内置函数（ gcc在这里记录它们 ）在编译时计算pow ，如果我们使用-fno-builtin它不再执行此优化。

If you change k to 1.0 then gcc is able to perform the same optimization: 如果将k更改为1.0则gcc可以执行相同的优化：

.L3:
    fadd    %st, %st(1) #,
    addl    $1, %eax    #, t
    cmpl    %eax, %edi  # t, num
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    jne .L3 #,

Although it is a simpler case. 虽然这是一个更简单的案例。

If you change the condition for the inner loop to n < 4 then gcc seems willing to optimize when k = 0.7 . 如果将内循环的条件更改为n < 4则gcc似乎愿意在k = 0.7时进行优化。 As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off. 正如对问题的评论中所指出的，如果编译器不相信展开将有所帮助，那么由于存在代码大小权衡，它将展开多少将是保守的。

As indicated in the comments I am using a modified version of the OP's code in the godbolt examples but it does not change the underlying conclusion. 正如评论中所示，我在使用Godbolt示例中使用OP代码的修改版本，但它并未改变基本结论。

Note as indicated in a comment above if we use -fno-math-errno , which stops errno from being set, gcc does apply a similar optimization . 注意如上面的注释中所示，如果我们使用-fno-math-errno ，它会阻止errno被设置，gcc会应用类似的优化。

Answer 2

In addition to Shafik Yaghmour's answer, I'd like to point out that the reason your use of volatile on the variable num appears to have no effect is that num is read before func is even called. 除了Shafik Yaghmour的回答之外，我想指出你在变量num上使用volatile似乎没有效果的原因是在func被调用之前读取了num 。 The read can't be optimized away, but the function call can still be optimized away. 无法优化读取，但仍可以优化函数调用。 If you declared the parameter of func to be a reference to volatile , ie. 如果你声明func的参数是对volatile的引用，即。 long double func(volatile int& num) , this would prevent the compiler from optimizing away the entire call to func . long double func(volatile int& num) ，这会阻止编译器优化掉对func的整个调用。

为什么clang产生比gcc更快的代码，这个涉及求幂的简单函数？

问题描述

2 个解决方案

解决方案1
33 已采纳 2015-10-27 02:27:50

解决方案2
1 2015-11-02 15:00:51

为什么clang产生比gcc更快的代码，这个涉及求幂的简单函数？

问题描述

2 个解决方案

解决方案1 33 已采纳 2015-10-27 02:27:50

解决方案2 1 2015-11-02 15:00:51

解决方案1
33 已采纳 2015-10-27 02:27:50

解决方案2
1 2015-11-02 15:00:51