简体   繁体   English

低性能–补丁匹配。 GPU上的图像处理(CUDA)

[英]Low performance – Patch Match. Image Processing on GPU (CUDA)

I have a performance problem: CPU and GPU performances are almost the same. 我有一个性能问题:CPU和GPU的性能几乎相同。

The Problem I Dealing with is PATCH MATCH. 我要处理的问题是补丁匹配。 I Have 2 Matrices. 我有2个矩阵。 I want to find where is the maximum similarity between the big matrix and the small one. 我想找到大矩阵和小矩阵之间最大的相似性在哪里。

The Matrices has Binary values 0/1 (Black and White). 矩阵的二进制值为0/1(黑白)。

When I am checking a match between a small matrix to a big one with i5 CPU, it takes 30ms (using multithreading). 当我使用i5 CPU检查小矩阵与大矩阵之间的匹配时,需要30毫秒(使用多线程)。

When I am checking a match between a small matrix to a big one in a Ge-force GT 730, it takes also 33ms. 当我检查Ge-force GT 730中的小矩阵与大矩阵之间的匹配时,也要花费33ms。

I would expect that The GPU will work faster in at least 1 magnitude of order. 我希望GPU至少能以1个数量级的速度运行。 I pretty disappointed from my current results. 我对目前的结果感到非常失望。

I have two matrices: 我有两个矩阵:

1) Big - 300000 (300 rows, 1000 columns) 1)大-300000(300行,1000列)

2) Small 50000 (50 rows, 1000 columns) 2)小50000(50行,1000列)

The comparing process is done by dividing the big matrix into 250 sub matrices and then comparing each one to the small matrix, then find highest similarity. 比较过程是通过将大矩阵划分为250个子矩阵,然后将每个矩阵与小矩阵进行比较,然后找到最高相似度来完成的。

The Similarity criterion is the sum of corresponding black pixels on both matrices (the small and the sub-big) divided by the sum of black pixels on sub-big. 相似性标准是两个矩阵(小像素和次大像素)上对应的黑色像素的总和除以次大像素上的黑色像素的总和。

I did the last task using the following CUDA code: 我使用以下CUDA代码完成了最后一个任务:

 __global__ void matCompare_cuda (uint8_t  *D_SUB , uint8_t  *D_SMALL ,  float *D_RSLTS , unsigned int step, int numOfIndentations ,int SUB_size, int SMALL_size)
{
    int  i = 0 , j = 0 , success = 0, sumZero = 0;    
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    int LoopIndex = ( tid * step );

    if (tid < numOfIndentations)            
    {
        for ( j = 0 ; j < (SMALL_size) ; j++)
            {
                i = j + LoopIndex;
                if ( D_SUB[i] == 0 )
                    {
                        {
                        sumZero++;
                        if ( D_SMALL[j] == 0 )                
                            success++;            
                        }
                    }
            }
        if (  success > 0 && sumZero > 500)
            D_RSLTS[tid] = 100*((float)success / sumZero) ;                 

    }
}

The Kernal launch: 内核发布:

int numOfIndentations = 300-50  //[ (big.row) - (small.row)]

int numBlock = 16;
int threadNumber = numOfIndentations/numBlock;

matCompare_cuda<<< numBlock , threadNumber >>> ( D_SUB , D_SMALL , D_RSLTS , step, numOfIndentations, SUB_size, SMALL_size ); 

The Cpu Code: Cpu代码:

 for (i=0; i < (pixelNum) ; i++)
{    
    if (SUB[i]==0)
    {
        sumDots = sumDots +1;
        if (SMALL->Image[i]==0)
        {
            success = success + 1;
        }    
    }
}


if (success>0)
    if (sumDots>500)    
        RSLT=((float)success/sumDots)*100;

Do you see any improvement that can be done in the GPU code? 您看到GPU代码可以完成任何改进吗?

A few things. 一些东西。 Try to avoid the if's if possible. 如果可能,请尽量避免使用if。 You can write here: 您可以在这里写:

sumZero += (1 - D_SUB[i])
success += (1 - D_SUB[i]) * (1 - D_SMALL[j])

However I don't think you're going to see a huge difference here. 但是我认为您不会在这里看到巨大的差异。 I see two reasons. 我看到两个原因。

One is that there's overhead in invoking cuda. 一种是调用cuda的开销。 The data needs to be copied to the graphic card and back. 数据需要复制到图形卡上,然后再复制回去。 That eats some of the speedup you get. 那吃了您获得的一些加速。 Not sure how much it is, but since the run-time is so short it could play a role. 不知道多少,但是由于运行时间太短,它可能会起作用。 I hope you didn't time the compilation of the kernel and other one-time things (take them out by running the code in a loop and ignoring the first few iterations). 我希望您不要浪费时间去编译内核和其他一次性的东西(通过循环运行代码并忽略前几次迭代来消除它们)。

Second your big matrix is too small and your small matrix is too big. 其次,您的大矩阵太小而您的小矩阵太大。 Because the small matrix is so big (1000 columns) I'm guessing it plays really well with the CPU cache lines. 因为小矩阵太大(1000列),所以我猜想它在CPU缓存行中的表现非常好。 If the small matrix were smaller you would have to go to the next line more often which would increase the chances of breaking the cache line. 如果较小的矩阵较小,则必须更频繁地转到下一行,这会增加破坏高速缓存行的机会。 The gpu uses rectangles for caching so it wouldn't be a problem. gpu使用矩形进行缓存,因此这不是问题。 If the big matrix were to be bigger you would also increase the amount of computation required so the GPU would start to get ahead. 如果大矩阵更大,那么您还将增加所需的计算量,以便GPU能够领先。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM