使用CUDA计算积分图像比使用CPU代码要慢

Question

I am implementing the integral image calculation module using CUDA to improve performance. 我正在使用CUDA实现积分图像计算模块，以提高性能。 But its speed slower than the CPU module. 但是它的速度比CPU模块慢。 Please let me know what i did wrong. 请让我知道我做错了。 cuda kernels and host code follow. 随后是cuda内核和主机代码。 And also, another problem is... In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below. 而且，另一个问题是...在内核SumH中，使用纹理内存比全局内存慢，imageTexture的定义如下。

texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);

// kernels to scan the image horizontally and vertically. //内核水平和垂直扫描图像。

__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
    int nStartY, nEndY, nIdx;
    if (!threadIdx.x)
    {
        nStartY = 1;
    }
    else
        nStartY = (int)(threadIdx.x * rVSpan);
    nEndY = (int)((threadIdx.x + 1) * rVSpan);

    for (int i = nStartY; i < nEndY; i ++)
    {
        for (int j = 1; j < nWidth; j ++)
        {
            nIdx = i * nWidth + j;
            pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
            pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
            //pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
            //pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
        }
    }
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
    int nStartX, nEndX, nIdx;
    if (!threadIdx.x)
    {
        nStartX = 1;
    }
    else
        nStartX = (int)(threadIdx.x * rHSpan);
    nEndX = (int)((threadIdx.x + 1) * rHSpan);

    for (int i = 1; i < nHeight; i ++)
    {
        for (int j = nStartX; j < nEndX; j ++)
        {
            nIdx = i * nWidth + j;
            pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
            pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
        }
    }
}

// host code //主机代码

    int nW = image_width;
    int nH = image_height;
    unsigned char* pbImage;
    int* pnIntImage;
    __int64* pn64SqrIntImage;
    cudaMallocManaged(&pbImage, nH * nW);
    // assign image gray values to pbimage
    cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
    cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
    float rHSpan, rVSpan;
        int nHThreadNum, nVThreadNum;
        if (nW + 1 <= 1024)
        {
            rHSpan = 1;
            nVThreadNum = nW + 1;
        }
        else
        {
            rHSpan = (float)(nW + 1) / 1024;
            nVThreadNum = 1024;
        }
        if (nH + 1 <= 1024)
        {
            rVSpan = 1;
            nHThreadNum = nH + 1;
        }
        else
        {
            rVSpan = (float)(nH + 1) / 1024;
            nHThreadNum = 1024;
        }

        SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
        cudaDeviceSynchronize();
        SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
        cudaDeviceSynchronize();

Answer 1

Regarding the code that is currently in the question. 关于当前问题中的代码。 There are two things I'd like to mention: launch parameters and timing methodology. 我想提两件事：启动参数和计时方法。

1) Launch parameters 1）启动参数

When you launch a kernel there are two main arguments that specify the amount of threads you are launching. 启动内核时，有两个主要参数指定正在启动的线程数量。 These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows: 这些位于<<<和>>>部分之间，是网格中的块数，以及每个块的线程数，如下所示：

foo <<< numBlocks, numThreadsPerBlock >>> (args);

For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. 为了使单个内核在当前GPU上高效运行，可以使用经验法则，即numBlocks * numThreadsPerBlock应该至少为10,000。 Ie. IE浏览器。 10,000 pieces of work. 10,000件作品。 This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. 这是一条经验法则，因此仅使用5,000个线程即可获得良好的结果（随GPU的不同而不同：便宜的GPU可以使用较少的线程摆脱故障），但这是您需要至少查看的数量级。 You are running 1024 threads. 您正在运行1024个线程。 This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel). 几乎可以肯定这还不够（提示：内核内部的循环看起来像扫描基元，可以并行完成）。

Further to this there are a few other things to consider. 除此之外，还需要考虑其他一些事项。

The number of blocks should be large in comparison to the number of SMs on your GPU. 与GPU上的SM数量相比，块的数量应该更大。 A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. 开普勒K40具有15个SM，为避免明显的尾部效应，您可能希望在此GPU上至少〜100个块。 Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific. 其他GPU的SM较少，但您尚未指定拥有的SM，因此我无法更具体地说明。
The number of threads per block should not be too small. 每个块的线程数不应太小。 You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. 每个SM上只能有这么多块，因此，如果块太小，您将无法最佳地使用GPU。 Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously , and as such is it often a good idea to have block sizes as multiples of 128. 此外，在较新的GPU上，最多四个扭曲可以同时在SM上接收指令，因此将块大小设为128的倍数通常是一个好主意。

2) Timing 2）计时

I'm not going to go into so much depth here, but make sure your timing is sane. 我不会在这里深入探讨，但是请确保您的时机合理。 GPU code tends to have a one-time initialisation delay. GPU代码倾向于具有一次性的初始化延迟。 If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. 如果这在您的时间范围内，您将看到用于表示更大代码的代码的错误运行时。 Similarly, data transfer between the CPU and GPU takes time. 同样，CPU和GPU之间的数据传输也需要时间。 In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch. 在实际的应用程序中，您只能对数千个内核调用执行一次此操作，但是在测试应用程序中，您可以在每次内核启动时执行一次。

If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated. 如果要获得准确的计时，则必须使示例更能代表最终代码，或者必须确保仅计时要重复的区域。

Answer 2

When working with CUDA there are a few things you should keep in mind. 使用CUDA时，请注意以下几点。

Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host. 从主机内存复制到设备内存的速度很慢-当您将某些数据从主机复制到设备时，在将数据复制回主机之前，应进行尽可能多的计算（做所有工作）。
On the device there are 3 types of memory - global, shared, local. 设备上有3种类型的内存-全局，共享，本地。 You can rank them in speed like global < shared < local (local = fastest). 您可以按照全局<共享<本地（本地=最快）的速度对它们进行排名。
Reading from consecutive memory blocks is faster than random access. 从连续的内存块中读取要比随机访问更快。 When working with array of structures you would like to transpose it to a structure of arrays. 使用结构数组时，您希望将其转置为数组结构。
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program. 您始终可以咨询CUDA Visual Profiler，以显示程序的瓶颈。

Answer 3

the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode). 上面提到的GTX750具有512个CUDA内核（与着色器单元相同，只是以/ different /模式驱动）。 http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2 http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2

the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. 创建积分图像的职责只能部分并行化，因为结果数组中的任何结果值都取决于更大的前身。 further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. 此外，每次内存传输仅占很小的数学部分，因此ALU的功能强大，因此不可避免的内存传输可能成为瓶颈。 such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it. 这样的加速器可能会提供一些加速，但由于任务本身不允许这样做，所以不能提供令人兴奋的加速。

if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. 如果您要在相同的输入数据上计算积分图像的多个变化，则由于并行度选项和数学运算量的增加，您将更有可能看到“刺激”。 but that would be a different duty then. 但这将是另一项职责。

as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA 作为来自Google搜索的一个疯狂猜测-其他人已经摆弄了那些东西： https : //www.google.de/url? sa = t&rct = j&q =& esrc = s&source = web&cd =11& cad = rja&uact =8& ved = 0CD8QFjAKahUKUKEwjjnoabw8bIAhXFvhQKHb =A% 3A％2F％2Fdspace.mit.edu％2Fopenaccess-为散发％2F1721.1％2F71883＆USG = AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA

Answer 4

The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess. 确保的唯一方法是分析代码，但是在这种情况下，我们可能可以做出合理的猜测。

You're basically just doing a single scan through some data, and doing extremely minimal processing on each item. 基本上，您只是对某些数据进行一次扫描，并且对每个项目进行的处理极少。

Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory. 考虑到您对每个项目进行的处理很少，使用CPU处理数据的瓶颈可能只是从内存中读取数据。

When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. 在GPU上进行处理时，仍然需要从内存中读取数据并将其复制到GPU的内存中。 That means we still have to read all the data from main memory, just like if the CPU did the processing. 这意味着我们仍然必须从主内存中读取所有数据，就像CPU进行了处理一样。 Worse, it all has to be written to the GPU's memory, causing a further slowdown. 更糟糕的是，所有这些都必须写入GPU的内存中，从而进一步降低速度。 By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job. 到GPU甚至开始进行实际处理时，您已经比CPU完成工作所需的时间更多。

For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. 为了使Cuda有意义，您通常需要对每个单独的数据项进行更多处理。 In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. 在这种情况下，CPU可能大部分时间已经几乎处于空闲状态，正在等待内存中的数据。 In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying. 在这种情况下，除非输入数据已经存在于GPU的内存中，否则 GPU不太会有帮助，因此GPU可以进行处理而无需任何额外的复制。

使用CUDA计算积分图像比使用CPU代码要慢

问题描述

4 个解决方案

解决方案1
2 已采纳 2014-08-14 12:22:10

1) Launch parameters 1）启动参数

2) Timing 2）计时

解决方案2
0 2014-08-14 11:03:39

解决方案3
0 2015-10-16 08:02:02

解决方案4
-1 2014-08-13 17:42:29

使用CUDA计算积分图像比使用CPU代码要慢

问题描述

4 个解决方案

解决方案1 2 已采纳 2014-08-14 12:22:10

1) Launch parameters 1）启动参数

2) Timing 2）计时

解决方案2 0 2014-08-14 11:03:39

解决方案3 0 2015-10-16 08:02:02

解决方案4 -1 2014-08-13 17:42:29

解决方案1
2 已采纳 2014-08-14 12:22:10

解决方案2
0 2014-08-14 11:03:39

解决方案3
0 2015-10-16 08:02:02

解决方案4
-1 2014-08-13 17:42:29