CUDA优化

Question

I developed Pincushion Distortion using CUDA to support real time - more than 40 fps for 3680*2456 Image Sequences. 我使用CUDA开发了枕形失真以支持实时-3680 * 2456图像序列的速度超过40 fps。

But it takes 130ms if I use CUDA - nVIDIA GeForce GT 610, 2GB DDR3. 但是如果我使用CUDA，则需要130毫秒-nVIDIA GeForce GT 610、2GB DDR3。

But it takes only 60ms if I use CPU and OpenMP - Core i7 3.4GHz, QuadCore. 但是，如果我使用CPU和OpenMP，只需60毫秒-Core i7 3.4GHz，QuadCore。

Please tell me what to do to speed up. 请告诉我该怎么做才能加快速度。 Thanks. 谢谢。

Full source can be downloaded here. 完整的源代码可以在这里下载。 https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharing https://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharing https://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing

The codes are as follows. 代码如下。

__global__
void undistort(int N, float k, int width, int height, int depth, int pitch, float R, float L, unsigned char* in_bits, unsigned char* out_bits)
{
    // Get the Index of the Array from GPU Grid/Block/Thread Index and Dimension.
    int i, j;
    i = blockIdx.y * blockDim.y + threadIdx.y;
    j = blockIdx.x * blockDim.x + threadIdx.x;

    // If Out of Array
    if (i >= height || j >= width)
    {
        return;
    }

    // Calculating Undistortion Equation.
    // In CPU, We used Fast Approximation equations of atan and sqrt - It makes 2 times faster.
    // But In GPU, No need to use Approximation Functions as it is faster.

    int cx = width  * 0.5;
    int cy = height * 0.5;

    int xt = j - cx;
    int yt = i - cy;

    float distance = sqrt((float)(xt*xt + yt*yt));
    float r = distance*k / R;

    float theta = 1;
    if (r == 0)
        theta = 1;
    else
        theta = atan(r)/r;

    theta = theta*L;

    float tx = theta*xt + cx;
    float ty = theta*yt + cy;

    // When we correct the frame, its size will be greater than Original.
    // So We should Crop it.
    if (tx < 0)
        tx = 0;
    if (tx >= width)
        tx = width - 1;
    if (ty < 0)
        ty = 0;
    if (ty >= height)
        ty = height - 1;

    // Output the Result.
    int ux = (int)(tx);
    int uy = (int)(ty);

    tx = tx - ux;
    ty = ty - uy;

    unsigned char *p = (unsigned char*)out_bits + i*pitch + j*depth;
    unsigned char *q00 = (unsigned char*)in_bits + uy*pitch + ux*depth;
    unsigned char *q01 = q00 + depth;
    unsigned char *q10 = q00 + pitch;
    unsigned char *q11 = q10 + depth;

    unsigned char newVal[4] = {0};
    for (int k = 0; k < depth; k++)
    {
        newVal[k] = (q00[k]*(1-tx)*(1-ty) + q01[k]*tx*(1-ty) + q10[k]*(1-tx)*ty + q11[k]*tx*ty);
        memcpy(p + k, &newVal[k], 1);
    }

}

void wideframe_correction(char* bits, int width, int height, int depth)
{
    // Find the device.
    // Initialize the nVIDIA Device.
    cudaSetDevice(0);
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, 0);

    // This works for Calculating GPU Time.
    cudaProfilerStart();

    // This works for Measuring Total Time
    long int dwTime = clock();

    // Setting Distortion Parameters
    // Note that Multiplying 0.5 works faster than divide into 2.
    int cx = (int)(width * 0.5);
    int cy = (int)(height * 0.5);
    float k = -0.73f;
    float R = sqrt((float)(cx*cx + cy*cy));

    // Set the Radius of the Result.
    float L = (float)(width<height ? width:height);
    L = L/2.0f;
    L = L/R;
    L = L*L*L*0.3333f;
    L = 1.0f/(1-L);

    // Create the GPU Memory Pointers.
    unsigned char* d_img_in = NULL;
    unsigned char* d_img_out = NULL;

    // Allocate the GPU Memory2D with pitch for fast performance.
    size_t pitch;
    cudaMallocPitch( (void**) &d_img_in, &pitch, width*depth, height );
    cudaMallocPitch( (void**) &d_img_out, &pitch, width*depth, height );
    _tprintf(_T("\nPitch : %d\n"), pitch);

    // Copy RAM data to VRAM.
    cudaMemcpy2D( d_img_in, pitch, 
            bits, width*depth, width*depth, height, 
            cudaMemcpyHostToDevice );
    cudaMemcpy2D( d_img_out, pitch, 
            bits, width*depth, width*depth, height, 
            cudaMemcpyHostToDevice );

    // Create Variables for Timing
    cudaEvent_t startEvent, stopEvent;
    cudaError_t err = cudaEventCreate(&startEvent, 0);
    assert( err == cudaSuccess );
    err = cudaEventCreate(&stopEvent, 0);
    assert( err == cudaSuccess );

    // Execution of the version using global memory
    float elapsedTime;
    cudaEventRecord(startEvent);

    // Process image
    dim3 dGrid(width / BLOCK_WIDTH + 1, height / BLOCK_HEIGHT + 1);
    dim3 dBlock(BLOCK_WIDTH, BLOCK_HEIGHT);

    undistort<<< dGrid, dBlock >>> (width*height, k,  width, height, depth, pitch, R, L, d_img_in, d_img_out);

    cudaThreadSynchronize();
    cudaEventRecord(stopEvent);
    cudaEventSynchronize( stopEvent );

    // Estimate the GPU Time.
    cudaEventElapsedTime( &elapsedTime, startEvent, stopEvent);

    // Calculate the Total Time.
    dwTime = clock() - dwTime;

    // Save Image data from VRAM to RAM
    cudaMemcpy2D( bits, width*depth, 
        d_img_out, pitch, width*depth, height,
        cudaMemcpyDeviceToHost );

    _tprintf(_T("GPU Processing Time(ms) : %d\n"), (int)elapsedTime);
    _tprintf(_T("VRAM Memory Read/Write Time(ms) : %d\n"), dwTime - (int)elapsedTime);
    _tprintf(_T("Total Time(ms) : %d\n"), dwTime );

    // Free GPU Memory
    cudaFree(d_img_in);
    cudaFree(d_img_out);
    cudaProfilerStop();
    cudaDeviceReset();
}

Answer 1

i've not read the source code, but there is some things you can't pass through. 我尚未阅读源代码，但是有些事情您无法通过。

your GPU has nearly same performance as your CPU: 您的GPU与CPU的性能几乎相同：

Adapt the follwing informations with your real GPU/CPU model. 根据您的真实GPU / CPU模型调整以下信息。

Specification | GPU          | CPU
----------------------------------------
Bandwith      | 14,4 GB/sec  | 25.6 GB/s
Flops         | 155 (FMA)    |  135

we can conclude that for memory bounded kernels your GPU will never be faster than your CPU. 我们可以得出结论，对于内存有限的内核，您的GPU将永远不会比您的CPU快。

GPU informations found here : http://www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2 可在此处找到GPU信息： http : //www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2

CPU informations found here : http://ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q=Intel%20Core%20i7%204770K 可在此处找到CPU信息： http : //ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q= Intel% 20Core%20i7%204770K

and here http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005 和这里http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005

Answer 2

One does not simply optimize the code just by looking to the source. 人们不会仅仅通过查看源代码来简单地优化代码。 First of all, you should use Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler and see, which part of your code on GPU is the one taking too much time. 首先，您应该使用Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler ，看看在GPU上哪部分代码花费了太多时间。 You might wish to write a UnitTest first however, just to be sure that only the investigated part of your project is tested. 但是，您可能希望首先编写一个UnitTest，以确保仅对项目的调查部分进行了测试。

Additionally, you can use CallGrind http://valgrind.org/docs/manual/cl-manual.html to test your CPU code performance. 此外，您可以使用CallGrind http://valgrind.org/docs/manual/cl-manual.html来测试您的CPU代码性能。

In general, this is not very surprising that your GPU "optimized" code is slower then "not optimized" one. 通常，GPU“优化”的代码要慢于“未优化”的代码，这不足为奇。 CUDA cores are usually several times slower than CPU and you have to actually introduce a lot of parallelism to notice a significant speed-up. CUDA内核通常比CPU慢几倍，并且您实际上必须引入很多并行性才能注意到明显的提速。

EDIT, response to your comment: 编辑，回复您的评论：

As a unit testing framework I strongly recommend GoogleTest. 作为一个单元测试框架，我强烈建议使用GoogleTest。 Here you can learn how to use it. 在这里您可以学习如何使用它。 Apart from its obvious functionalities (code testing) it allows you to run only specific methods from your class interfaces for performance analysis. 除了其明显的功能（代码测试）之外，它还允许您仅从类接口运行特定方法进行性能分析。

In general, Nvidia profiler is just a tool that runs your code and tells you how much time each of your kernel consume. 通常，Nvidia profiler只是运行代码的工具，它告诉您每个内核消耗多少时间。 Please look to their documentation . 请查阅他们的文档。

By "lot of parallelism" I meant: on your processor you can run 8 x 3.4GHz threads, your GPU has one SM (streaming multiprocessor) with 810MHz clock, lets say 1024 threads per SM (I do not have exact data, but you can run deviceQuery Nvidia script to know the exact parameters), therefore if your GPU code can run (3.4*8)/0.81 = 33 computations in parallel, you will achieve exactly nothing. 我所说的“大量并行性”是指：在您的处理器上，您可以运行8个3.4GHz线程，您的GPU拥有一个SM（流式多处理器），时钟频率为810MHz，每个SM可以有1024个线程（我没有确切的数据，但是您可以运行deviceQuery Nvidia脚本以了解确切的参数），因此，如果您的GPU代码可以并行运行（3.4 * 8）/0.81 = 33个计算，您将一无所获。 Execution time of your CPU and GPU code will be the same (neglecting L-cache GPU memory copying, which is expensive). 您的CPU和GPU代码的执行时间将相同（忽略L-cache GPU内存复制，这很昂贵）。 Conclusion: your GPU code should be able to compute at least ~ 40 operations in parallel to introduce any speed-up. 结论：您的GPU代码应至少能够并行计算约40个操作，以实现任何提速。 On the other hand, lets say that you are able to fully use your GPU potential and you can keep all 1024 on your SM busy all the time. 另一方面，可以说您可以充分利用GPU的潜力，并且可以使SM上的所有1024始终保持忙碌状态。 In that case your code will run only (0.81*1024)/(8*3.4) = 30 times faster (approximately, remember that we neglect GPU L-cache operations), which is impossible in most cases, because usually you are not able to parallelize your serial code with such efficiency. 在这种情况下，您的代码将仅以（0.81 * 1024）/（8 * 3.4）= 30倍的速度运行（大约，请记住，我们忽略了GPU L-cache操作），这在大多数情况下是不可能的，因为通常您无法以这种效率并行化您的串行代码。

Wish you good luck with your research! 祝您研究顺利！

Answer 3

Yes, put nvprof to good use, it's a great tool. 是的，充分利用nvprof，这是一个很好的工具。

What I could see from your code... 1. Consider using linear thread blocks instead of flat blocks, it could save up some integer operations. 我可以从您的代码中看到什么... 1.考虑使用线性线程块而不是平面块，它可以节省一些整数运算。 2. Manual correction of image borders and/or thread indices leads to massive divergence and/or impacts coalescing. 2.手动校正图像边界和/或线程索引会导致大量差异和/或影响合并。 Consider using texture fetches and/or pre-padding data. 考虑使用纹理提取和/或预填充数据。 3. memcpy single value from inside the kernel is generally a bad idea. 3.从内核内部使用memcpy单值通常是一个坏主意。 4. Try to minimize type conversions. 4.尝试最小化类型转换。

CUDA优化

问题描述

3 个解决方案

解决方案1
2 2016-01-26 14:47:09

解决方案2
0 已采纳 2016-01-26 12:25:49

解决方案3
0 2016-02-15 13:05:55

CUDA优化

问题描述

3 个解决方案

解决方案1 2 2016-01-26 14:47:09

解决方案2 0 已采纳 2016-01-26 12:25:49

解决方案3 0 2016-02-15 13:05:55

解决方案1
2 2016-01-26 14:47:09

解决方案2
0 已采纳 2016-01-26 12:25:49

解决方案3
0 2016-02-15 13:05:55