简体   繁体   English

为什么clang产生比gcc更快的代码,这个涉及求幂的简单函数?

[英]Why does clang produce a much faster code than gcc for this simple function involving exponentiation?

The following code compiled with clang runs almost 60 times faster than the one compiled with gcc with identical compiler flags (either -O2 or -O3 ): 使用clang编译的以下代码比使用具有相同编译器标志( -O2-O3 )的gcc编译的代码运行快近60倍:

#include <iostream>
#include <math.h> 
#include <chrono>
#include <limits>

long double func(int num)
{
    long double i=0;
    long double k=0.7;

    for(int t=1; t<num; t++){
      for(int n=1; n<16; n++){
        i += pow(k,n);
      }
    }
    return i;
}


int main()
{
   volatile auto num = 3000000; // avoid constant folding

   std::chrono::time_point<std::chrono::system_clock> start, end;
   start = std::chrono::system_clock::now();

   auto i = func(num);

   end = std::chrono::system_clock::now();
   std::chrono::duration<double> elapsed = end-start;
   std::cout.precision(std::numeric_limits<long double>::max_digits10);
   std::cout << "Result " << i << std::endl;
   std::cout << "Elapsed time is " << elapsed.count() << std::endl;

   return 0;
}

I have tested this with three gcc versions 4.8.4/4.9.2/5.2.1 and two clang versions 3.5.1/3.6.1 and here are the timings on my machine (for gcc 5.2.1 and clang 3.6.1 ): 我用三个gcc版本4.8.4/4.9.2/5.2.1和两个clang版本3.5.1/3.6.1测试了这个,这是我机器上的时间(对于gcc 5.2.1clang 3.6.1 ) :

Timing -O3 : 时间-O3

gcc:    2.41888s
clang:  0.0396217s 

Timing -O2 : 时间-O2

gcc:    2.41024s
clang:  0.0395114s 

Timing -O1 : 时间-O1

gcc:    2.41766s
clang:  2.43113s

So it seems that gcc does not optimize this function at all even at higher optimization levels. 因此,即使在更高的优化级别, gcc似乎也不会优化此功能。 The assembly output of clang is almost around 100 lines longer than gcc and I don't think it is necessary to post it here, all I can say is that in gcc assembly output there is a call to pow which does not appear in clang assembly, presumably because clang optimizes it to a bunch of intrinsic calls. clang的组装输出几乎比gcc大约100行,我认为没有必要在这里发布,我只能说在gcc汇编输出中有一个调用pow ,它不会出现在clang程序集中,大概是因为clang它优化为一堆内在的调用。

Since the results are identical (ie i = 6966764.74717416727754 ), the question is: 由于结果相同(即i = 6966764.74717416727754 ),问题是:

  1. Why can gcc not optimize this function when clang can? 为什么gcc不能在clang时优化这个功能?
  2. Change the value of k to 1.0 and gcc becomes as fast, is there a floating point arithmetic issue that gcc cannot by-pass? k的值更改为1.0并且gcc变得一样快,是否存在gcc无法绕过的浮点运算问题?

I did try static_cast ing and turned on the warnings to see if there was any issue with implicit conversions, but not really. 我确实尝试过static_cast并打开警告,看看隐式转换是否有任何问题,但不是真的。

Update: For completeness here are the results for -Ofast 更新:为了完整性,这里是-Ofast的结果

gcc:    0.00262204s
clang:  0.0013267s

The point is that gcc does not optimize the code at O2/O3 . 关键是gcc没有优化O2/O3的代码。

From this godbolt session clang is able to perform all the pow calculations at compile time. 从这个godbolt会话中, clang能够在编译时执行所有的pow计算。 It knows at compile time what the values of k and n are and it just constant folds the calculation: 它在编译时知道kn的值是什么,它只是常数折叠计算:

.LCPI0_0:
    .quad   4604480259023595110     # double 0.69999999999999996
.LCPI0_1:
    .quad   4602498675187552091     # double 0.48999999999999994
.LCPI0_2:
    .quad   4599850558606658239     # double 0.34299999999999992
.LCPI0_3:
    .quad   4597818534454788671     # double 0.24009999999999995
.LCPI0_4:
    .quad   4595223380205512696     # double 0.16806999999999994
.LCPI0_5:
    .quad   4593141924544133109     # double 0.11764899999999996
.LCPI0_6:
    .quad   4590598673379842654     # double 0.082354299999999963
.LCPI0_7:
    .quad   4588468774839143248     # double 0.057648009999999972
.LCPI0_8:
    .quad   4585976388698138603     # double 0.040353606999999979
.LCPI0_9:
    .quad   4583799016135705775     # double 0.028247524899999984
.LCPI0_10:
    .quad   4581356477717521223     # double 0.019773267429999988
.LCPI0_11:
    .quad   4579132580613789641     # double 0.01384128720099999
.LCPI0_12:
    .quad   4576738892963968780     # double 0.0096889010406999918
.LCPI0_13:
    .quad   4574469401809764420     # double 0.0067822307284899942
.LCPI0_14:
    .quad   4572123587912939977     # double 0.0047475615099429958

and it unrolls the inner loop: 并且它展开内循环:

.LBB0_2:                                # %.preheader
    faddl   .LCPI0_0(%rip)
    faddl   .LCPI0_1(%rip)
    faddl   .LCPI0_2(%rip)
    faddl   .LCPI0_3(%rip)
    faddl   .LCPI0_4(%rip)
    faddl   .LCPI0_5(%rip)
    faddl   .LCPI0_6(%rip)
    faddl   .LCPI0_7(%rip)
    faddl   .LCPI0_8(%rip)
    faddl   .LCPI0_9(%rip)
    faddl   .LCPI0_10(%rip)
    faddl   .LCPI0_11(%rip)
    faddl   .LCPI0_12(%rip)
    faddl   .LCPI0_13(%rip)
    faddl   .LCPI0_14(%rip)

Note, that it is using a builtin function( gcc documents theirs here ) to calculate pow at compile time and if we use -fno-builtin it no longer performs this optimization. 注意,它使用内置函数( gcc在这里记录它们 )在编译时计算pow ,如果我们使用-fno-builtin它不再执行此优化。

If you change k to 1.0 then gcc is able to perform the same optimization: 如果将k更改为1.0gcc可以执行相同的优化:

.L3:
    fadd    %st, %st(1) #,
    addl    $1, %eax    #, t
    cmpl    %eax, %edi  # t, num
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    fadd    %st, %st(1) #,
    jne .L3 #,

Although it is a simpler case. 虽然这是一个更简单的案例。

If you change the condition for the inner loop to n < 4 then gcc seems willing to optimize when k = 0.7 . 如果将内循环的条件更改为n < 4gcc似乎愿意k = 0.7进行优化 As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off. 正如对问题的评论中所指出的,如果编译器不相信展开将有所帮助,那么由于存在代码大小权衡,它将展开多少将是保守的。

As indicated in the comments I am using a modified version of the OP's code in the godbolt examples but it does not change the underlying conclusion. 正如评论中所示,我在使用Godbolt示例中使用OP代码的修改版本,但它并未改变基本结论。

Note as indicated in a comment above if we use -fno-math-errno , which stops errno from being set, gcc does apply a similar optimization . 注意如上面注释中所示,如果我们使用-fno-math-errno ,它会阻止errno被设置,gcc会应用类似的优化

In addition to Shafik Yaghmour's answer, I'd like to point out that the reason your use of volatile on the variable num appears to have no effect is that num is read before func is even called. 除了Shafik Yaghmour的回答之外,我想指出你在变量num上使用volatile似乎没有效果的原因是在func被调用之前读取了num The read can't be optimized away, but the function call can still be optimized away. 无法优化读取,但仍可以优化函数调用。 If you declared the parameter of func to be a reference to volatile , ie. 如果你声明func的参数是对volatile的引用,即。 long double func(volatile int& num) , this would prevent the compiler from optimizing away the entire call to func . long double func(volatile int& num) ,这会阻止编译器优化掉对func的整个调用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么在这个递归斐波那契代码中 GCC 生成的程序比 Clang 更快? - Why does GCC generate a faster program than Clang in this recursive Fibonacci code? 为什么gcc和clang会为成员函数模板参数生成非常不同的代码? - Why does gcc and clang produce very differnt code for member function template parameters? 为什么这个简单的lambda在std :: thread中始终比在gcc 4.9.2的main函数中运行更快? - Why does this simple lambda consistently run faster inside an std::thread than inside the main function with gcc 4.9.2? 为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍? (带有 *(long*)float 类型双关语) - Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning) 为什么 gcc 会产生这个奇怪的程序集 vs clang? - Why does gcc produce this weird assembly vs clang? 为什么 gcc 和 clang 会为 std::find 生成这么多代码? - Why do gcc and clang generate so much code for std::find? 为什么这段代码用gcc编译而不用clang编译 - Why does this code compile with gcc but not with clang 为什么这段代码可以用 MSVC 编译,但不能在 GCC 或 Clang 中编译? - Why does this code compile with MSVC, but not in GCC or Clang? 为什么这个非常简单的Makefile产生如此多的调试记录? - Why does this incredibly simple Makefile produce so much debug logging? 为什么这段代码在 Python 中比在 C++ 中快得多? - Why is this code in Python so much faster than in C++?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM