简体   繁体   中英

Optimizing a floating point division and conversion operation

I have the following formula

float mean = (r+b+g)/3/255.0f;

I want to speed it up. There are the following preconditions

0<= mean <= 1  and 0 <= r,g,b <= 255 and r, g, b are unsigned chars

so if I try to use the fact that >> 8 is like dividing by 256 and I use something like

float mean = (float)(((r+b+g)/3) >> 8);

this will always return 0. Is there a way to skip the costly float division and still end up with a mean between 0 and 1?

Pre-convert your divisions into a multiplicable constant:

a / 3 / 255

is the same as

a * (1 / (3 * 255))

so pre-compute:

const float AVERAGE_SCALE_FACTOR = 1.f / (3.f * 255.f)

then just do

float mean = (r + g + b) * AVERAGE_SCALE_FACTOR;

since multiplying is generally a lot faster than dividing.

你明显将平均值与其他东西进行比较,也就是在0和1之间。你怎么把这个东西乘以255呢?

Lets find out what a real compiler actually does with this code shall we? I like mingw gcc 4.3 (x86). I used "gcc test.c -O2 -S -c -Wall"

This function:

float calc_mean(unsigned char r, unsigned char g, unsigned char b)
{
    return (r+b+g)/3/255.0f;
}

generates this object code (function entry and exit code removed for clarity. I hope the comments I added are roughly correct):

movzbl 12(%ebp), %edx    ; edx = g
 movzbl 8(%ebp), %eax     ; eax = r
 addl %eax, %edx        ; edx = eax + edx
 movzbl 16(%ebp), %eax    ; eax = b
 addl %eax, %edx        ; edx = eax + edx
 movl $1431655766, %eax ; 
 imull %edx              ; edx *= a const
 flds LC0               ; put a const in the floating point reg
 pushl %edx              ; put edx on the stack
 fidivrl (%esp)            ; float reg /= top of stack

Whereas this function:

float calc_mean2(unsigned char r, unsigned char g, unsigned char b)
{
    const float AVERAGE_SCALE_FACTOR = 1.f / (3.f * 255.f);
    return (r+b+g) * AVERAGE_SCALE_FACTOR;
}

generates this:

movzbl 12(%ebp), %eax    
 movzbl 8(%ebp), %edx
 addl %edx, %eax
 movzbl 16(%ebp), %edx
 addl %edx, %eax
 flds LC2
 pushl %eax
 fimull (%esp)

As you can see, the second function is better. Compiling with -freciprocal-math converts the fidivrl from the first function into an fimull, which ought to be an improvement. But the second function is still better.

However, if you consider that a modern desktop CPU has something like an 18 stage pipeline and that it is capable of executing several of these instructions per cycle, you can see that the performance of these functions will be dominated by stalls due to data dependencies. Hopefully your program has this code snippet inlined and with some loop unrolling.

Considering such a small code fragment in isolation isn't ideal. It's a bit like driving a car with binoculars glued to your eye sockets. Zoom out man!

As shown by Andrew, the original function is not optimized at all. The compiler couldn't because you were dividing the sum first by an integer and then by a float. That's not the same as multiplying by the aforementioned average scale factor. If you would change (r+g+b)/3/255.0f into (r+g+b)/3.0f/255.0f, the compiler might optimize it to use fimull automatically.

为平台优化此类操作是非常常见的,而不是作为算法或可移植的C. 虚拟配音博客非常值得阅读有关如何在针对x86和x64架构的软件中完成的提示,并且有几个关于优化像素平均值的条目1 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM