Reason for different speeds of floating point division with and without opt.flag -O3 (C++/C)

Question

I was trying to measure the speed difference of single precision division vs double precision division in C++

Here is the simple code that I have written.

#include <iostream>
#include <time.h>

int main(int argc, char *argv[])
{

  float     f_x = 45672.0;
  float     f_y = 67783.0;
  double    d_x = 45672.0;
  double    d_y = 67783.0;

  float     f_answer;
  double    d_answer;

  clock_t   start,stop;
  int       N = 200000000 //2*10^8


 start = clock();
 for (int i = 0; i < N; ++i)
  {
    f_answer = f_x/f_y;
  }
 stop = clock();
 std::cout<<"Single Precision:"<< (stop-start)/(double)CLOCKS_PER_SEC<<"    "<<f_answer <<std::endl;


start = clock();
for (int i = 0; i < N; ++i)
  {
    d_answer = d_x/d_y;
  }
stop = clock();
std::cout<<"Double precision:" <<(stop-start)/(double)CLOCKS_PER_SEC<<"   "<< d_answer<<std::endl;

return 0;
}

When I compiled the code without optimization as g++ test.cpp I got the following output

Desktop: ./a.out
Single precision:8.06    0.673797
Double precision:12.68   0.673797

But if I compile this with g++ -O3 test.cpp then I get

Desktop: ./a.out
Single precision:0    0.673797
Double precision:0   0.673797

How did I get such a drastic performance increase? The time being shown in the second case is 0 because of the low resolution of the clock() function. Did the compiler somehow detect that each for loop iteration is independent of the previous iterations?

Answer 1

Probably because the compiler optimised the loop away to a single iteration. It may even have done the division at compile-time.

Check the assembler of your executable to be sure (use eg objdump).

Answer 2

Looking at the assembly that you get from g++ -O3 -S , it's quite apparent the loops and all of your floating point calculations (aside from those involving the time) were optimized out of existence:

        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB970:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        pushq   %rbx
        .cfi_def_cfa_offset 24
        .cfi_offset 3, -24
        subq    $24, %rsp
        .cfi_def_cfa_offset 48
        call    clock
        movq    %rax, %rbx
        call    clock
        movq    %rax, %rbp
        movl    $.LC0, %esi
        movl    std::cout, %edi
        subq    %rbx, %rbp
        call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)

See the two calls to clock , one right after the other? And before those, only some stack maintenance instructions. Yep, those loops are completely gone.

You only use f_answer or d_answer to print out an answer that can be trivially calculated at compile time, and the compiler can see that. There's no point in even having them. And if there's no point in having them, there's no point in having f_x , f_y , d_x , or d_y either. All gone.

To solve this, you need to have each iteration of the loop depend on the results from the last iteration. Here is my solution to this problem. I use the complex template to do some calculations involved in calculating the Mandlebrot set:

#include <iostream>
#include <time.h>
#include <complex>

int main(int argc, char *argv[])
{
   using ::std::complex;
   using ::std::cout;

   const complex<float> f_coord(0.1, 0.1);
   const complex<double> d_coord(0.1, 0.1);

   complex<float> f_answer(0, 0);
   complex<double> d_answer(0, 0);

   clock_t   start, stop;
   const unsigned int N = 200000000; //2*10^8

   start = clock();
   for (unsigned int i = 0; i < N; ++i)
   {
      f_answer = (f_answer * f_answer) + f_coord;
   }
   stop = clock();
   cout << "Single Precision: " << (stop-start)/(double)CLOCKS_PER_SEC
        << "    " << f_answer << '\n';


   start = clock();
   for (unsigned int i = 0; i < N; ++i)
   {
      d_answer = (d_answer * d_answer) + d_coord;
   }
   stop = clock();
   cout << "Double precision: " <<(stop-start)/(double)CLOCKS_PER_SEC
        << "   " << d_answer << '\n';

   return 0;
}

Answer 3

如果在浮点数和双精度数的定义中添加volatile限定符，编译器将不会优化掉未使用的计算。

Reason for different speeds of floating point division with and without opt.flag -O3 (C++/C)

Question

3 answers

solution1
7 2011-11-14 19:42:16

solution2
5 ACCPTED 2011-11-14 20:21:00

solution3
1 2011-11-14 20:33:39

Reason for different speeds of floating point division with and without opt.flag -O3 (C++/C)

Question

3 answers

solution1 7 2011-11-14 19:42:16

solution2 5 ACCPTED 2011-11-14 20:21:00

solution3 1 2011-11-14 20:33:39

solution1
7 2011-11-14 19:42:16

solution2
5 ACCPTED 2011-11-14 20:21:00

solution3
1 2011-11-14 20:33:39