[英]CUDA kernel not called by all blocks
I got a strange behavior trying to run a simple vector addition. 尝试运行简单的向量加法时,我遇到了奇怪的行为。 If I run the code below using the printf function, everything runs fine and I got the expected result, 5050.
如果我使用printf函数运行以下代码,则一切运行正常,我得到了预期的结果5050。
Now if I remove the printf function, only the first block is executed and I got 2080 which is the expected result for the sum up to 64. 现在,如果删除了printf函数,则仅执行第一个块,并且得到2080,这是总计64之和的预期结果。
Does anyone know what's happening here? 有人知道这里发生了什么吗?
Thanks in advance for your help. 在此先感谢您的帮助。
vecSum.cu: vecSum.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <math.h>
#define BLOCK_SIZE 64
__global__
void vecSumKernel(int N, float *d_v, float *d_out)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int t = threadIdx.x;
printf("Processing block #: %i\n", blockIdx.x);
__shared__ float partialSum[BLOCK_SIZE];
if(idx < N)
partialSum[t] = d_v[idx];
else
partialSum[t] = 0;
for(unsigned int stride=1; stride < BLOCK_SIZE; stride *= 2)
{
__syncthreads();
if(t % (2*stride) == 0)
partialSum[t] += partialSum[t + stride];
}
__syncthreads();
*d_out += partialSum[0];
}
void vecSum_wrapper(int N, float *v, float &out, cudaDeviceProp devProp)
{
float *d_v;
float *d_out;
size_t size = N*sizeof(float);
cudaMalloc(&d_v, size);
cudaMalloc(&d_out, sizeof(float));
cudaMemcpy(d_v, v, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, &out, sizeof(float), cudaMemcpyHostToDevice);
int nbrBlocks = ceil((float)N / (float)BLOCK_SIZE);
vecSumKernel<<<nbrBlocks, BLOCK_SIZE>>>(N, d_v, d_out);
cudaDeviceSynchronize();
cudaMemcpy(&out, d_out, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_v);
}
main.cpp: main.cpp中:
int main()
{
...
int N = 100;
float *vec = new float[N];
for(int i=0; i < N; ++i)
vec[i] = i + 1;
std::chrono::time_point<timer> start = timer::now();
float result = 0;
vecSum_wrapper(N, vec, result, devProp);
std::cout << "Operation executed in " << std::chrono::duration_cast<chrono>(timer::now() - start).count() << " ms \n";
std::cout << "Result: " << result << '\n';
delete[] vec;
return 0;
}
It seems that the last line of your kernel *d_out += partialSum[0]
may expose some concurrency issues, as you surely know __syncthreads
does not synchronize blocks. 内核
*d_out += partialSum[0]
的最后一行似乎可能暴露出一些并发问题,因为您肯定知道__syncthreads
不会同步块。 atomicAdd
may solve this concurrency issue. atomicAdd
可以解决此并发问题。
As for the reason why it works better with printf
, I would assume that the printf requires some synchronization, hence blocks would not enter this last instruction at the same time, but I have Nothing to prove this. 至于为什么它与
printf
更好地配合的原因,我会假设printf需要一些同步,因此块不会同时输入最后一条指令,但是我没有任何证据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.