[英]Threads and Blocks of Cuda are not working exactly
I m making algorithem in cuda for image processing in Visual Studio 2010. In my coding i got a problem with working on threads and blocks of cuda. 我正在cuda中制作算法,以便在Visual Studio 2010中进行图像处理。在我的编码中,我遇到了处理cuda的线程和块的问题。 So my sample code of C and CUDA is below, and C code is work fine, but CUDA code is not work exactly.
因此,我的C和CUDA示例代码如下,C代码可以正常工作,但是CUDA代码不能完全正常工作。 My C code
我的C代码
void checkGpuBlockValue(unsigned int *a,unsigned int *b,int length)
{
for(int i=0;i<length;i++){
b[i]=a[i]+i;
}
}
int main()
{
const int range=1000;
unsigned int *a=new unsigned int[range];
unsigned int *b=new unsigned int[range];
for(int i=0;i<range;i++)
{
a[i]=i;
}
checkGpuBlockValue(a,b,range);
for(int j=0;j<range;j++)
{
cout<<"b["<<j<<"] = "<<b[j]<<std::endl;
}
}
OutPut = 输出=
OutPut :
b[0] = 0
b[1] = 2
b[2] = 4
b[3] = 6
b[4] = 8
.
.
.
.
.
b[996] = 1992
b[997] = 1994
b[998] = 1996
b[999] = 1998
this works fine. 这很好。
My CUDA code(not working well) is ; 我的CUDA代码(工作不正常)是;
__global__
void checkGpuBlockValue(unsigned int *a,unsigned int *b,int length)
{
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if(i<length){
b[i]=a[i]+i;
}
}
int main()
{
const int range=1000;
unsigned int *a=new unsigned int[range];
unsigned int *b=new unsigned int[range];
unsigned int *dev_a;
unsigned int *dev_b;
for(int i=0;i<range;i++)
{
a[i]=i;
}
cudaMalloc( (void**)&dev_a, range* sizeof(unsigned int));
cudaMalloc( (void**)&dev_b, range* sizeof(unsigned int));
cudaMemcpy(dev_a, a, range, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, a, range, cudaMemcpyHostToDevice);
static const int BLOCK_WIDTH = 8;
//1024 is the maximum number of threads per block for modern GPUs.
int x = static_cast<int>(ceilf(static_cast<float>(range) / BLOCK_WIDTH));
const dim3 grid (x,1);
const dim3 block(BLOCK_WIDTH,1);
checkGpuBlockValue<<<grid,block>>>(dev_a,dev_b,range);
cudaDeviceSynchronize();
cudaMemcpy(b, dev_b, range, cudaMemcpyDeviceToHost);
for(int j=0;j<range;j++)
{
cout<<"b["<<j<<"] = "<<b[j]<<std::endl;
}
cudaFree(dev_a);
cudaFree(dev_b);
}
OUT PUT is : 输出为:
Out Put =
b[0] = 0
b[1] = 2
b[2] = 4
b[3] = 6
.
.
.
.
.
b[242] = 484
b[243] = 486
b[244] = 488
b[245] = 490
b[246] = 492
b[247] = 494
b[248] = 496
b[249] = 498
b[250] = 3452816845
b[251] = 3452816845
b[252] = 3452816845
b[253] = 3452816845
b[254] = 3452816845
b[255] = 3452816845
b[256] = 3452816845
.
.
.
.
.
.
b[996] = 3452816845
b[997] = 3452816845
b[998] = 3452816845
b[999] = 3452816845
In my code im puting value of 0 to 1000 in int *a and than add that *a with value from 0 to 1000 and result is storing in int *b. 在我的代码中,将值0到1000放入int * a中,然后添加* a的值从0到1000,结果存储在int * b中。 So my code is work well for 0 to 249(upto 250) loop, but after 250 it gives wrong value.
因此,我的代码对于0到249(最多250个)循环是很好的,但是在250之后,它给出了错误的值。 So what is wrong im doing here ?
那么我在这里做什么错了? please suggest to me.
请给我建议。
Just by looking at your code looks like your problem is in these lines 只是看一下代码,看起来问题出在这些行中
cudaMemcpy(dev_a, a, range, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, a, range, cudaMemcpyHostToDevice);
....
....
cudaMemcpy(b, dev_b, range, cudaMemcpyDeviceToHost);
should be 应该
cudaMemcpy(dev_a, a, range* sizeof(unsigned int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, a, range* sizeof(unsigned int), cudaMemcpyHostToDevice);
....
....
cudaMemcpy(b, dev_b, range * sizeof(unsigned int), cudaMemcpyDeviceToHost);
I just checked by modifying your code it works as you expected. 我只是通过修改您的代码来检查它是否按预期工作。 But I strongly recommend you to do the proper error checking as good programming practice.
但是我强烈建议您按照正确的编程习惯进行正确的错误检查。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.