有和没有opt.flag -O3（C ++ / C）的浮点除法速度不同的原因

Question

I was trying to measure the speed difference of single precision division vs double precision division in C++ 我试图测量C ++中单精度除法与双精度除法的速度差异

Here is the simple code that I have written. 这是我编写的简单代码。

#include <iostream>
#include <time.h>

int main(int argc, char *argv[])
{

  float     f_x = 45672.0;
  float     f_y = 67783.0;
  double    d_x = 45672.0;
  double    d_y = 67783.0;

  float     f_answer;
  double    d_answer;

  clock_t   start,stop;
  int       N = 200000000 //2*10^8


 start = clock();
 for (int i = 0; i < N; ++i)
  {
    f_answer = f_x/f_y;
  }
 stop = clock();
 std::cout<<"Single Precision:"<< (stop-start)/(double)CLOCKS_PER_SEC<<"    "<<f_answer <<std::endl;


start = clock();
for (int i = 0; i < N; ++i)
  {
    d_answer = d_x/d_y;
  }
stop = clock();
std::cout<<"Double precision:" <<(stop-start)/(double)CLOCKS_PER_SEC<<"   "<< d_answer<<std::endl;

return 0;
}

When I compiled the code without optimization as g++ test.cpp I got the following output 当我编译没有优化的代码作为g++ test.cpp我得到了以下输出

Desktop: ./a.out
Single precision:8.06    0.673797
Double precision:12.68   0.673797

But if I compile this with g++ -O3 test.cpp then I get 但是如果我用g++ -O3 test.cpp编译它，那么我得到了

Desktop: ./a.out
Single precision:0    0.673797
Double precision:0   0.673797

How did I get such a drastic performance increase? 我是如何得到如此大幅度的性能提升的？ The time being shown in the second case is 0 because of the low resolution of the clock() function. 由于clock()函数的低分辨率，第二种情况下显示的时间为0。 Did the compiler somehow detect that each for loop iteration is independent of the previous iterations? 编译器是否以某种方式检测到每个for循环迭代是否独立于先前的迭代？

Answer 1

Probably because the compiler optimised the loop away to a single iteration. 可能是因为编译器将循环优化为单次迭代。 It may even have done the division at compile-time. 它甚至可能在编译时进行了划分。

Check the assembler of your executable to be sure (use eg objdump). 检查可执行文件的汇编程序以确定（使用例如objdump）。

Answer 2

Looking at the assembly that you get from g++ -O3 -S , it's quite apparent the loops and all of your floating point calculations (aside from those involving the time) were optimized out of existence: 看看你从g++ -O3 -S得到的程序集，很明显循环和所有浮点计算（除了涉及时间的那些）都被优化了：

        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB970:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        pushq   %rbx
        .cfi_def_cfa_offset 24
        .cfi_offset 3, -24
        subq    $24, %rsp
        .cfi_def_cfa_offset 48
        call    clock
        movq    %rax, %rbx
        call    clock
        movq    %rax, %rbp
        movl    $.LC0, %esi
        movl    std::cout, %edi
        subq    %rbx, %rbp
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)

See the two calls to clock , one right after the other? 看到两个clock呼叫，一个接着一个接一个？ And before those, only some stack maintenance instructions. 在那之前，只有一些堆栈维护说明。 Yep, those loops are completely gone. 是的，那些循环完全消失了。

You only use f_answer or d_answer to print out an answer that can be trivially calculated at compile time, and the compiler can see that. 您只使用f_answer或d_answer打印出一个可以在编译时轻松计算的答案，编译器可以看到。 There's no point in even having them. 即使拥有它们也没有意义。 And if there's no point in having them, there's no point in having f_x , f_y , d_x , or d_y either. 如果拥有它们没有意义，那么拥有f_x ， f_y ， d_x或d_y也没有意义。 All gone. 全没了。

To solve this, you need to have each iteration of the loop depend on the results from the last iteration. 要解决这个问题，您需要让循环的每次迭代都依赖于上次迭代的结果。 Here is my solution to this problem. 这是我解决这个问题的方法。 I use the complex template to do some calculations involved in calculating the Mandlebrot set: 我使用complex模板来计算Mandlebrot集合中的一些计算：

#include <iostream>
#include <time.h>
#include <complex>

int main(int argc, char *argv[])
{
   using ::std::complex;
   using ::std::cout;

   const complex<float> f_coord(0.1, 0.1);
   const complex<double> d_coord(0.1, 0.1);

   complex<float> f_answer(0, 0);
   complex<double> d_answer(0, 0);

   clock_t   start, stop;
   const unsigned int N = 200000000; //2*10^8

   start = clock();
   for (unsigned int i = 0; i < N; ++i)
   {
      f_answer = (f_answer * f_answer) + f_coord;
   }
   stop = clock();
   cout << "Single Precision: " << (stop-start)/(double)CLOCKS_PER_SEC
        << "    " << f_answer << '\n';


   start = clock();
   for (unsigned int i = 0; i < N; ++i)
   {
      d_answer = (d_answer * d_answer) + d_coord;
   }
   stop = clock();
   cout << "Double precision: " <<(stop-start)/(double)CLOCKS_PER_SEC
        << "   " << d_answer << '\n';

   return 0;
}

Answer 3

如果在浮点数和双精度数的定义中添加volatile限定符，编译器将不会优化掉未使用的计算。

有和没有opt.flag -O3（C ++ / C）的浮点除法速度不同的原因

问题描述

3 个解决方案

解决方案1
7 2011-11-14 19:42:16

解决方案2
5 已采纳 2011-11-14 20:21:00

解决方案3
1 2011-11-14 20:33:39

有和没有opt.flag -O3（C ++ / C）的浮点除法速度不同的原因

问题描述

3 个解决方案

解决方案1 7 2011-11-14 19:42:16

解决方案2 5 已采纳 2011-11-14 20:21:00

解决方案3 1 2011-11-14 20:33:39

解决方案1
7 2011-11-14 19:42:16

解决方案2
5 已采纳 2011-11-14 20:21:00

解决方案3
1 2011-11-14 20:33:39