简体   繁体   English

优化浮点除法和转换操​​作

[英]Optimizing a floating point division and conversion operation

I have the following formula 我有以下公式

float mean = (r+b+g)/3/255.0f;

I want to speed it up. 我想加快速度。 There are the following preconditions 有以下先决条件

0<= mean <= 1  and 0 <= r,g,b <= 255 and r, g, b are unsigned chars

so if I try to use the fact that >> 8 is like dividing by 256 and I use something like 因此,如果我尝试使用>> 8就像除以256的事实,我会使用类似的东西

float mean = (float)(((r+b+g)/3) >> 8);

this will always return 0. Is there a way to skip the costly float division and still end up with a mean between 0 and 1? 这将始终返回0.有没有办法跳过昂贵的浮动分区,最终仍然是0到1之间的平均值?

Pre-convert your divisions into a multiplicable constant: 将您的分区预转换为可乘法常量:

a / 3 / 255

is the same as 是相同的

a * (1 / (3 * 255))

so pre-compute: 所以预先计算:

const float AVERAGE_SCALE_FACTOR = 1.f / (3.f * 255.f)

then just do 然后就做

float mean = (r + g + b) * AVERAGE_SCALE_FACTOR;

since multiplying is generally a lot faster than dividing. 因为乘法通常比分割快很多。

你明显将平均值与其他东西进行比较,也就是在0和1之间。你怎么把这个东西乘以255呢?

Lets find out what a real compiler actually does with this code shall we? 让我们看看真正的编译器实际上用这个代码做了什么? I like mingw gcc 4.3 (x86). 我喜欢mingw gcc 4.3(x86)。 I used "gcc test.c -O2 -S -c -Wall" 我用过“gcc test.c -O2 -S -c -Wall”

This function: 这个功能:

float calc_mean(unsigned char r, unsigned char g, unsigned char b)
{
    return (r+b+g)/3/255.0f;
}

generates this object code (function entry and exit code removed for clarity. I hope the comments I added are roughly correct): 生成此对象代码(为了清楚起见,删除了函数入口和退出代码。我希望我添加的注释大致正确):

movzbl 12(%ebp), %edx    ; edx = g
 movzbl 8(%ebp), %eax     ; eax = r
 addl %eax, %edx        ; edx = eax + edx
 movzbl 16(%ebp), %eax    ; eax = b
 addl %eax, %edx        ; edx = eax + edx
 movl $1431655766, %eax ; 
 imull %edx              ; edx *= a const
 flds LC0               ; put a const in the floating point reg
 pushl %edx              ; put edx on the stack
 fidivrl (%esp)            ; float reg /= top of stack

Whereas this function: 而这个功能:

float calc_mean2(unsigned char r, unsigned char g, unsigned char b)
{
    const float AVERAGE_SCALE_FACTOR = 1.f / (3.f * 255.f);
    return (r+b+g) * AVERAGE_SCALE_FACTOR;
}

generates this: 生成这个:

movzbl 12(%ebp), %eax    
 movzbl 8(%ebp), %edx
 addl %edx, %eax
 movzbl 16(%ebp), %edx
 addl %edx, %eax
 flds LC2
 pushl %eax
 fimull (%esp)

As you can see, the second function is better. 如您所见,第二个功能更好。 Compiling with -freciprocal-math converts the fidivrl from the first function into an fimull, which ought to be an improvement. 使用-freciprocal-math进行编译会将fidivrl从第一个函数转换为fimull,这应该是一个改进。 But the second function is still better. 但第二个功能仍然更好。

However, if you consider that a modern desktop CPU has something like an 18 stage pipeline and that it is capable of executing several of these instructions per cycle, you can see that the performance of these functions will be dominated by stalls due to data dependencies. 但是,如果您认为现代桌面CPU具有类似18级流水线的功能并且每个周期能够执行其中几条指令,您可以看到这些功能的性能将由于数据依赖性而受到停顿的支配。 Hopefully your program has this code snippet inlined and with some loop unrolling. 希望你的程序有这个代码片段内联并且有一些循环展开。

Considering such a small code fragment in isolation isn't ideal. 考虑到隔离的这种小代码片段并不理想。 It's a bit like driving a car with binoculars glued to your eye sockets. 这有点像用双筒望远镜粘在眼窝上驾驶汽车。 Zoom out man! 缩小男人!

As shown by Andrew, the original function is not optimized at all. 如Andrew所示,原始功能根本没有优化。 The compiler couldn't because you were dividing the sum first by an integer and then by a float. 编译器不能,因为您首先将总和除以整数,然后除以浮点数。 That's not the same as multiplying by the aforementioned average scale factor. 这与乘以上述平均比例因子不同。 If you would change (r+g+b)/3/255.0f into (r+g+b)/3.0f/255.0f, the compiler might optimize it to use fimull automatically. 如果你将(r + g + b)/3/255.0f改为(r + g + b)/3.0f/255.0f,编译器可能会优化它以自动使用fimull。

为平台优化此类操作是非常常见的,而不是作为算法或可移植的C. 虚拟配音博客非常值得阅读有关如何在针对x86和x64架构的软件中完成的提示,并且有几个关于优化像素平均值的条目1 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM