[英]It's slower to calculate integral image using CUDA than CPU code
I am implementing the integral image calculation module using CUDA to improve performance. 我正在使用CUDA实现积分图像计算模块,以提高性能。 But its speed slower than the CPU module.
但是它的速度比CPU模块慢。 Please let me know what i did wrong.
请让我知道我做错了。 cuda kernels and host code follow.
随后是cuda内核和主机代码。 And also, another problem is... In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
而且,另一个问题是...在内核SumH中,使用纹理内存比全局内存慢,imageTexture的定义如下。
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically. //内核水平和垂直扫描图像。
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code //主机代码
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();
Regarding the code that is currently in the question. 关于当前问题中的代码。 There are two things I'd like to mention: launch parameters and timing methodology.
我想提两件事:启动参数和计时方法。
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. 启动内核时,有两个主要参数指定正在启动的线程数量。 These are between the
<<<
and >>>
sections, and are the number of blocks in the grid, and the number of threads per block as follows: 这些位于
<<<
和>>>
部分之间,是网格中的块数,以及每个块的线程数,如下所示:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. 为了使单个内核在当前GPU上高效运行,可以使用经验法则,即numBlocks * numThreadsPerBlock应该至少为10,000。 Ie.
IE浏览器。 10,000 pieces of work.
10,000件作品。 This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum.
这是一条经验法则,因此仅使用5,000个线程即可获得良好的结果(随GPU的不同而不同:便宜的GPU可以使用较少的线程摆脱故障),但这是您需要至少查看的数量级。 You are running 1024 threads.
您正在运行1024个线程。 This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
几乎可以肯定这还不够(提示:内核内部的循环看起来像扫描基元,可以并行完成)。
Further to this there are a few other things to consider. 除此之外,还需要考虑其他一些事项。
I'm not going to go into so much depth here, but make sure your timing is sane. 我不会在这里深入探讨,但是请确保您的时机合理。 GPU code tends to have a one-time initialisation delay.
GPU代码倾向于具有一次性的初始化延迟。 If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code.
如果这在您的时间范围内,您将看到用于表示更大代码的代码的错误运行时。 Similarly, data transfer between the CPU and GPU takes time.
同样,CPU和GPU之间的数据传输也需要时间。 In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
在实际的应用程序中,您只能对数千个内核调用执行一次此操作,但是在测试应用程序中,您可以在每次内核启动时执行一次。
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated. 如果要获得准确的计时,则必须使示例更能代表最终代码,或者必须确保仅计时要重复的区域。
When working with CUDA there are a few things you should keep in mind. 使用CUDA时,请注意以下几点。
the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode). 上面提到的GTX750具有512个CUDA内核(与着色器单元相同,只是以/ different /模式驱动)。 http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. 创建积分图像的职责只能部分并行化,因为结果数组中的任何结果值都取决于更大的前身。 further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck.
此外,每次内存传输仅占很小的数学部分,因此ALU的功能强大,因此不可避免的内存传输可能成为瓶颈。 such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
这样的加速器可能会提供一些加速,但由于任务本身不允许这样做,所以不能提供令人兴奋的加速。
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. 如果您要在相同的输入数据上计算积分图像的多个变化,则由于并行度选项和数学运算量的增加,您将更有可能看到“刺激”。 but that would be a different duty then.
但这将是另一项职责。
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA 作为来自Google搜索的一个疯狂猜测-其他人已经摆弄了那些东西: https : //www.google.de/url? sa = t&rct = j&q =& esrc = s&source = web&cd =11& cad = rja&uact =8& ved = 0CD8QFjAKahUKUKEwjjnoabw8bIAhXFvhQKHb =A% 3A%2F%2Fdspace.mit.edu%2Fopenaccess-为散发%2F1721.1%2F71883&USG = AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA
The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess. 确保的唯一方法是分析代码,但是在这种情况下,我们可能可以做出合理的猜测。
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item. 基本上,您只是对某些数据进行一次扫描,并且对每个项目进行的处理极少。
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory. 考虑到您对每个项目进行的处理很少,使用CPU处理数据的瓶颈可能只是从内存中读取数据。
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. 在GPU上进行处理时,仍然需要从内存中读取数据并将其复制到GPU的内存中。 That means we still have to read all the data from main memory, just like if the CPU did the processing.
这意味着我们仍然必须从主内存中读取所有数据,就像CPU进行了处理一样。 Worse, it all has to be written to the GPU's memory, causing a further slowdown.
更糟糕的是,所有这些都必须写入GPU的内存中,从而进一步降低速度。 By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
到GPU甚至开始进行实际处理时,您已经比CPU完成工作所需的时间更多。
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. 为了使Cuda有意义,您通常需要对每个单独的数据项进行更多处理。 In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory.
在这种情况下,CPU可能大部分时间已经几乎处于空闲状态,正在等待内存中的数据。 In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.
在这种情况下, 除非输入数据已经存在于GPU的内存中, 否则 GPU不太会有帮助,因此GPU可以进行处理而无需任何额外的复制。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.