[英]Reason for different speeds of floating point division with and without opt.flag -O3 (C++/C)
I was trying to measure the speed difference of single precision division vs double precision division in C++ 我试图测量C ++中单精度除法与双精度除法的速度差异
Here is the simple code that I have written. 这是我编写的简单代码。
#include <iostream>
#include <time.h>
int main(int argc, char *argv[])
{
float f_x = 45672.0;
float f_y = 67783.0;
double d_x = 45672.0;
double d_y = 67783.0;
float f_answer;
double d_answer;
clock_t start,stop;
int N = 200000000 //2*10^8
start = clock();
for (int i = 0; i < N; ++i)
{
f_answer = f_x/f_y;
}
stop = clock();
std::cout<<"Single Precision:"<< (stop-start)/(double)CLOCKS_PER_SEC<<" "<<f_answer <<std::endl;
start = clock();
for (int i = 0; i < N; ++i)
{
d_answer = d_x/d_y;
}
stop = clock();
std::cout<<"Double precision:" <<(stop-start)/(double)CLOCKS_PER_SEC<<" "<< d_answer<<std::endl;
return 0;
}
When I compiled the code without optimization as g++ test.cpp
I got the following output 当我编译没有优化的代码作为g++ test.cpp
我得到了以下输出
Desktop: ./a.out
Single precision:8.06 0.673797
Double precision:12.68 0.673797
But if I compile this with g++ -O3 test.cpp
then I get 但是如果我用g++ -O3 test.cpp
编译它,那么我得到了
Desktop: ./a.out
Single precision:0 0.673797
Double precision:0 0.673797
How did I get such a drastic performance increase? 我是如何得到如此大幅度的性能提升的? The time being shown in the second case is 0 because of the low resolution of the clock()
function. 由于clock()
函数的低分辨率,第二种情况下显示的时间为0。 Did the compiler somehow detect that each for loop iteration is independent of the previous iterations? 编译器是否以某种方式检测到每个for循环迭代是否独立于先前的迭代?
Probably because the compiler optimised the loop away to a single iteration. 可能是因为编译器将循环优化为单次迭代。 It may even have done the division at compile-time. 它甚至可能在编译时进行了划分。
Check the assembler of your executable to be sure (use eg objdump). 检查可执行文件的汇编程序以确定(使用例如objdump)。
Looking at the assembly that you get from g++ -O3 -S
, it's quite apparent the loops and all of your floating point calculations (aside from those involving the time) were optimized out of existence: 看看你从g++ -O3 -S
得到的程序集,很明显循环和所有浮点计算(除了涉及时间的那些)都被优化了:
.section .text.startup,"ax",@progbits
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB970:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
pushq %rbx
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
subq $24, %rsp
.cfi_def_cfa_offset 48
call clock
movq %rax, %rbx
call clock
movq %rax, %rbp
movl $.LC0, %esi
movl std::cout, %edi
subq %rbx, %rbp
call std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
See the two calls to clock
, one right after the other? 看到两个clock
呼叫,一个接着一个接一个? And before those, only some stack maintenance instructions. 在那之前,只有一些堆栈维护说明。 Yep, those loops are completely gone. 是的,那些循环完全消失了。
You only use f_answer
or d_answer
to print out an answer that can be trivially calculated at compile time, and the compiler can see that. 您只使用f_answer
或d_answer
打印出一个可以在编译时轻松计算的答案,编译器可以看到。 There's no point in even having them. 即使拥有它们也没有意义。 And if there's no point in having them, there's no point in having f_x
, f_y
, d_x
, or d_y
either. 如果拥有它们没有意义,那么拥有f_x
, f_y
, d_x
或d_y
也没有意义。 All gone. 全没了。
To solve this, you need to have each iteration of the loop depend on the results from the last iteration. 要解决这个问题,您需要让循环的每次迭代都依赖于上次迭代的结果。 Here is my solution to this problem. 这是我解决这个问题的方法。 I use the complex
template to do some calculations involved in calculating the Mandlebrot set: 我使用complex
模板来计算Mandlebrot集合中的一些计算:
#include <iostream>
#include <time.h>
#include <complex>
int main(int argc, char *argv[])
{
using ::std::complex;
using ::std::cout;
const complex<float> f_coord(0.1, 0.1);
const complex<double> d_coord(0.1, 0.1);
complex<float> f_answer(0, 0);
complex<double> d_answer(0, 0);
clock_t start, stop;
const unsigned int N = 200000000; //2*10^8
start = clock();
for (unsigned int i = 0; i < N; ++i)
{
f_answer = (f_answer * f_answer) + f_coord;
}
stop = clock();
cout << "Single Precision: " << (stop-start)/(double)CLOCKS_PER_SEC
<< " " << f_answer << '\n';
start = clock();
for (unsigned int i = 0; i < N; ++i)
{
d_answer = (d_answer * d_answer) + d_coord;
}
stop = clock();
cout << "Double precision: " <<(stop-start)/(double)CLOCKS_PER_SEC
<< " " << d_answer << '\n';
return 0;
}
如果在浮点数和双精度数的定义中添加volatile
限定符,编译器将不会优化掉未使用的计算。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.