简体   繁体   English

如何优化简单的高斯滤波器的性能?

[英]How to optimize simple gaussian filter for performance?

I am trying to write an android app which needs to calculate gaussian and laplacian pyramids for multiple full resolution images, i wrote this it on C++ with NDK, the most critical part of the code is applying gaussian filter to images abd i am applying this filter with horizontally and vertically. 我正在尝试编写一个需要为多个全分辨率图像计算高斯和拉普拉斯金字塔的android应用,我用NDK在C ++上编写了此代码,代码中最关键的部分是对图像应用高斯滤波器,而我正在应用该滤波器与水平和垂直。

The filter is (0.0625, 0.25, 0.375, 0.25, 0.0625) Since i am working on integers i am calculating (1, 4, 6, 4, 1)/16 过滤器为(0.0625,0.25,0.375,0.25,0.0625)由于我正在处理整数,因此我正在计算(1、4、6、4、1)/ 16

dst[index] = ( src[index-2] + src[index-1]*4 + src[index]*6+src[index+1]*4+src[index+2])/16;

I have made a few simple optimization however it still is working slow than expected and i was wondering if there are any other optimization options that i am missing. 我做了一些简单的优化,但是它的工作速度仍然比预期的要慢,我想知道是否还有其他优化选项遗失。

PS: I should mention that i have tried to write this filter part with inline arm assembly however it give 2x slower results. PS:我应该提一下,我曾尝试用内联臂组件来编写此过滤器部件,但是它的结果要慢2倍。

//horizontal  filter
for(unsigned y = 0; y < height;  y++) {
    for(unsigned x = 2; x < width-2;  x++) {
        int index = y*width+x;
            dst[index].r = (src[index-2].r+ src[index+2].r + (src[index-1].r + src[index+1].r)*4 + src[index].r*6)>>4;
            dst[index].g = (src[index-2].g+ src[index+2].g + (src[index-1].g + src[index+1].g)*4 + src[index].g*6)>>4;
            dst[index].b = (src[index-2].b+ src[index+2].b + (src[index-1].b + src[index+1].b)*4 + src[index].b*6)>>4;                
     }
}
//vertical filter
for(unsigned y = 2;  y < height-2;  y++) {
    for(unsigned x = 0;  x < width;  x++) {
        int index = y*width+x;
            dst[index].r = (src[index-2*width].r + src[index+2*width].r  + (src[index-width].r + src[index+width].r)*4 + src[index].r*6)>>4;
            dst[index].g = (src[index-2*width].g + src[index+2*width].g  + (src[index-width].g + src[index+width].g)*4 + src[index].g*6)>>4;
            dst[index].b = (src[index-2*width].b + src[index+2*width].b  + (src[index-width].b + src[index+width].b)*4 + src[index].b*6)>>4;
     }
}

The index multiplication can be factored out of the inner loop since the mulitplicatation only occurs when y is changed: index乘法可以从内部循环中分解出来,因为仅当y更改时才发生乘法乘法:

for (unsigned y ...
{
    int index = y * width;
    for (unsigned int x...  

You may gain some speed by loading variables before you use them. 通过在使用变量之前加载变量,可以提高速度。 This would make the processor load them in the cache: 这将使处理器将它们加载到缓存中:

for (unsigned x = ...  
{  
    register YOUR_DATA_TYPE a, b, c, d, e;
    a = src[index - 2].r;
    b = src[index - 1].r;
    c = src[index + 0].r; // The " + 0" is to show a pattern.
    d = src[index + 1].r;
    e = src[index + 2].r;
    dest[index].r = (a + e + (b + d) * 4 + c * 6) >> 4;
    // ...  

Another trick would be to "cache" the values of the src so that only a new one is added each time because the value in src[index+2] may be used up to 5 times. 另一个技巧是“缓存” src的值,以便每次仅添加一个新值,因为src[index+2]的值最多可以使用5次。

So here is a example of the concepts: 因此,这里是概念的示例:

//horizontal  filter
for(unsigned y = 0; y < height;  y++)
{
    int index = y*width + 2;
    register YOUR_DATA_TYPE a, b, c, d, e;
    a = src[index - 2].r;
    b = src[index - 1].r;
    c = src[index + 0].r; // The " + 0" is to show a pattern.
    d = src[index + 1].r;
    e = src[index + 2].r;
    for(unsigned x = 2; x < width-2;  x++)
    {
        dest[index - 2 + x].r = (a + e + (b + d) * 4 + c * 6) >> 4;
        a = b;
        b = c;
        c = d;
        d = e;
        e = src[index + x].r;

I'm not sure how your compiler would optimize all this, but I tend to work in pointers. 我不确定您的编译器如何优化所有这些,但是我倾向于使用指针。 Assuming your struct is 3 bytes... You can start with pointers in the right places (the edge of the filter for source, and the destination for target), and just move them through using constant array offsets. 假设您的结构是3个字节...您可以从正确位置的指针(源的过滤器边缘和目标的目的地的边缘)开始,然后使用恒定的数组偏移量移动它们。 I've also put in an optional OpenMP directive on the outer loop, as this can also improve things. 我还在外部循环上放置了一个可选的OpenMP指令,因为这也可以改善情况。

#pragma omp parallel for
for(unsigned y = 0; y < height;  y++) {
    const int rowindex = y * width;
    char * dpos = (char*)&dest[rowindex+2];
    char * spos = (char*)&src[rowindex];
    const char *end = (char*)&src[rowindex+width-2];

    for( ; spos != end;  spos++, dpos++) {
        *dpos = (spos[0] + spos[4] + ((spos[1] + src[3])<<2) + spos[2]*6) >> 4;
    }
}

Similarly for the vertical loop. 对于垂直循环也是如此。

const int scanwidth = width * 3;
const int row1 = scanwidth;
const int row2 = row1+scanwidth;
const int row3 = row2+scanwidth;
const int row4 = row3+scanwidth;

#pragma omp parallel for
for(unsigned y = 2;  y < height-2;  y++) {
    const int rowindex = y * width;
    char * dpos = (char*)&dest[rowindex];
    char * spos = (char*)&src[rowindex-row2];
    const char *end = spos + scanwidth;

    for( ; spos != end;  spos++, dpos++) {
        *dpos = (spos[0] + spos[row4] + ((spos[row1] + src[row3])<<2) + spos[row2]*6) >> 4;
    }
}

This is how I do convolutions, anyway. 无论如何,这就是我进行卷积的方式。 It sacrifices readability a little, and I've never tried measuring the difference. 它稍微牺牲了可读性,而且我从未尝试过衡量差异。 I just tend to write them that way from the outset. 我只是倾向于从一开始就这样写它们。 See if that gives you a speed-up. 看看是否可以加快速度。 The OpenMP definitely will if you have a multicore machine, and the pointer stuff might . 如果您拥有多核计算机,则OpenMP肯定会,并且指针可能会

I like the comment about using SSE for these operations. 我喜欢有关将SSE用于这些操作的评论。

Some of the more obvious optimizations are exploiting the symmetry of the kernel: 一些更明显的优化是利用内核的对称性:

a=*src++;    b=*src++;    c=*src++;    d=*src++;    e=*src++; // init

LOOP (n/5) times:
z=(a+e)+(b+d)<<2+c*6;  *dst++=z>>4;  // then reuse the local variables
a=*src++;
z=(b+a)+(c+e)<<2+d*6;  *dst++=z>>4;  // registers have been read only once...
b=*src++;
z=(c+b)+(d+a)<<2+e*6;  *dst++=z>>4;
e=*src++;

The second thing is that one can perform multiple additions using a single integer. 第二件事是可以使用一个整数执行多个加法。 When the values to be filtered are unsigned, one can fit two channels in a single 32-bit integer (or 4 channels in a 64-bit integer); 如果要过滤的值是无符号的,则可以在一个32位整数中容纳两个通道(或者在一个64位整数中容纳4个通道)。 it's the poor mans SIMD. 是穷人的SIMD。

a=  0x[0011][0034]  <-- split to two 
b=  0x[0031][008a]
----------------------
sum    0042  00b0
>>4    0004  200b0  <-- mask off
mask   00ff  00ff   
-------------------
       0004  000b   <-- result 

(The Simulated SIMD shows one addition followed by a shift by 4) (模拟的SIMD显示一个加法,后移4)

Here's a kernel that calculates 3 rgb operations in parallel (easy to modify for 6 rgb operations in 64-bit architectures...) 这是一个可并行计算3个rgb操作的内核(在64位体系结构中易于修改6个rgb操作...)

#define MASK (255+(255<<10)+(255<<20))
#define KERNEL(a,b,c,d,e) { \
 a=((a+e+(c<<1))>>2) & MASK; a=(a+b+c+d)>>2 & MASK; *DATA++ = a; a=DATA[4]; }

void calc_5_rgbs(unsigned int *DATA)
{
   register unsigned int a = DATA[0], b=DATA[1], c=DATA[2], d=DATA[3], e=DATA[4];
   KERNEL(a,b,c,d,e);
   KERNEL(b,c,d,e,a);
   KERNEL(c,d,e,a,b);
   KERNEL(d,e,a,b,c);
   KERNEL(e,a,b,c,d);
}

Works best on ARM and on 64-bit IA with 16 registers... Needs heavy assembler optimizations to overcome register shortage in 32-bit IA (eg use ebp as GPR). 在ARM和带有16个寄存器的64位IA上效果最佳。...需要大量的汇编程序优化以克服32位IA中的寄存器不足(例如,使用ebp作为GPR)。 And just because of that it's an inplace algorithm... 正因为如此,这是一个就地算法...

There are just 2 guardian bits between every 8 bits of data, which is just enough to get exactly the same result as in integer calculation. 每8位数据之间只有2个监护位,足以获得与整数计算完全相同的结果。

And BTW: it's faster to just run through the array byte per byte than by r,g,b elements 顺便说一句:仅通过字节遍历数组字节要比通过r,g,b元素快

 unsigned char *s=(unsigned char *) source_array; 
 unsigned char *d=(unsigned char *) dest_array; 
 for (j=0;j<3*N;j++) d[j]=(s[j]+s[j+16]+s[j+8]*6+s[j+4]*4+s[j+12]*4)>>4;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM