简体   繁体   English

CUDA atomicAdd失败

[英]CUDA atomicAdd failed

The following CUDA kernel is supposed to do image slices addition for an 3D image, ie, you collapse the 3D volume along one dimension and produce one 2D image through doing pixel-wise additions. 以下CUDA内核应该对3D图像进行图像切片加法,即,沿一维折叠3D体积,并通过逐像素加法生成一个2D图像。 The image_in data pointer has size 128 * 128 * 128, which was obtained from an ITK::Image using the function GetOutputBuffer(). image_in数据指针的大小为128 * 128 * 128,这是使用函数GetOutputBuffer()从ITK :: Image获得的。 After reading the ITK documentation, I think we can safely assume that the data pointer points to an segment of continuous memory of the image data, without padding. 阅读ITK文档后,我认为我们可以安全地假定数据指针指向图像数据的连续存储段,而无需填充。 The image_out is just a 2D image of size 128 * 128, also produced from an ITK::Image. image_out只是大小为128 * 128的2D图像,也是由ITK :: Image生成的。 I included the info about the images just for completeness but the question is more about CUDA atomic and might be very elementary. 我包括有关图像的信息只是为了完整性,但问题更多是关于CUDAatomic,可能非常基础。 The code compute the thread id first and project the id into the range of 128 * 128, which means all pixels in the same line along the dimension we perform addition will have the same idx. 该代码首先计算线程ID,然后将ID投影到128 * 128的范围内,这意味着沿着我们执行加法运算的尺寸,同一行中的所有像素都将具有相同的IDX。 Then using this idx, atomicAdd was used to update the image_out. 然后使用此idx,使用atomicAdd更新image_out。

__global__ void add_slices(int* image_in, int* image_out) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int idx = tid % (128 * 128);
    int temp = image_in[tid];

    atomicAdd( &image_out[idx], temp );

}

The way I initialized the image_out is through the following, there are two ways I tried with the similar results: 我初始化image_out的方法是通过以下方法进行的,有两种方法尝试了类似的结果:

int* image_out = new int[128 * 128];
for (...) {
    /* assign image_out to zeros */
}

and the one using ITK interface: 和一个使用ITK界面的人:

out_image->SetRegions(region2d);
out_image->Allocate();
out_image->FillBuffer(0);
// Obtain the data buffer
int* image_out = out_image->GetOutputBuffer();

Then I setup CUDA as the following: 然后,我将CUDA设置如下:

unsigned int size_in = 128 * 128 * 128;
unsigned int size_out = 128 * 128;
int *dev_in;
int *dev_out;
cudaMalloc( (void**)&dev_in, size_in * sizeof(int) );
cudaMalloc( (void**)&dev_out, size_out * sizeof(int));
cudaMemcpy( dev_in, image_in, size_in * sizeof(int), cudaMemcpyHostToDevice );
add_slices<<<size_in/64, 64 >>>(dev_in, dev_out);
cudaMemcpy( image_out, dev_out, size_out * sizeof(int), cudaMemcpyDeviceToHost);

Is there any problem to the above code? 上面的代码有什么问题吗? The reason why I am seeking help here comes from the frastration that the above code sometimes might produce the right result (once every 50 times I run the code, maybe, I swear I have seen the correct result at least twice), while the rest of the time just produced some garbages. 我之所以在这里寻求帮助,是因为上面的代码有时可能会产生正确的结果(每运行50次代码一次,也许我发誓至少看到了两次正确的结果),而其余的则如此时间只是产生了一些垃圾。 Does the issue comes from the atomicAdd() function? 问题是否来自atomicAdd()函数? At the beginning my image type was of double, which CUDA doesn't support atomicAdd(double*, double) so I used the code provided by Nvidia as the following 最初,我的图像类型是double,而CUDA不支持atomicAdd(double *,double),因此我使用了Nvidia提供的代码,如下所示:

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                                          (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed, 
                        __double_as_longlong(val + 
                        __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

Then just for testing purpose I switched all my image to int then the situation was still the same that most of the time garbages while once in a blue moon correct result. 然后只是出于测试目的,我将所有图像都切换为int,然后情况还是一样,大部分时间都在浪费时间,而结果却是正确的。

Do I need to turn on some compiling flag? 我需要打开一些编译标志吗? I am using CMAKE to build the project using 我正在使用CMAKE使用以下内容构建项目

find_package(CUDA QUIET REQUIRED)

for the CUDA support. 获得CUDA支持。 The following is the way I setup the CUDA_NVCC_FLAGS 以下是我设置CUDA_NVCC_FLAGS的方法

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -arch=sm_30"),

maybe I missed something? 也许我错过了什么?

Any suggestion will be greatly appreciated and I will update the question if more info of the code is needed. 任何建议将不胜感激,如果需要更多代码信息,我将更新问题。

So it turns out that the solution to this problem is adding the following line to initialize the memory pointed by dev_out. 因此,事实证明,解决该问题的方法是添加以下行来初始化dev_out所指向的内存。

cudaMemcpy( dev_out, image_out, size_out * sizeof(int), cudaMemcpyHostToDevice );

I forgot to initialize it since I was thinking that it is a output variable and I initialized it on the host. 我忘记了对其进行初始化,因为我一直认为它是一个输出变量,因此我在主机上对其进行了初始化。

Just like that talonmies said, it has nothing to do with atomicAdd at all. 就像这些爪牙说的那样,它与atomicAdd根本无关。 Both int version and double version of atomicAdd works perfectly. atomicAdd的int版本和double版本都可以正常工作。 Just remember to initialize your variable on device. 只要记住要在设备上初始化变量即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM