简体   繁体   English

计算没有结果

[英]No result obtained from calculation

I am trying to convolve an image using CUDA, but I cannot get a result.我正在尝试使用 CUDA 对图像进行卷积,但我无法得到结果。 cuda-gdb does not work properly on my system so I cannot tell what is happening inside the CUDA kernel. cuda-gdb 在我的系统上无法正常工作,所以我无法判断 CUDA kernel 内部发生了什么。 The CUDA kernel I am using is the following:我使用的 CUDA kernel 如下:

__global__
void
convolve_component_EXTEND_kern(const JSAMPLE *data, // image data
                           ssize_t data_width, // image width
                           ssize_t data_height, // image height
                           const float *kern, // convolution kernel data
                           ssize_t kern_w_f, // convolution kernel has a width of 2 * kern_w_f + 1
                           ssize_t kern_h_f, // convolution_kernel has a height of 2 * kern_h_f + 1
                           JSAMPLE *res) // array to store the result
{
ssize_t i = ::blockIdx.x * ::blockDim.x + ::threadIdx.x;
ssize_t j = ::blockIdx.y * ::blockDim.y + ::threadIdx.y;

float value = 0;

for (ssize_t m = 0; m < 2 * kern_w_f + 1; m++) {
    for (ssize_t n = 0; n < 2 * kern_h_f + 1; n++) {
            ssize_t x = i + m - kern_w_f; // column index for this contribution to convolution sum for (i, j)
            ssize_t y = j + n - kern_h_f; // row index for ...
            x = x < 0 ? 0 : (x >= data_width ? data_width - 1 : x);
            y = y < 0 ? 0 : (y >= data_height ? data_height - 1 : y);
            value += ((float) data[data_width * y + x]) * kern[(2 * kern_w_f + 1) * n + m];
    }
}

res[data_width * j + i] = (JSAMPLE) value;
}

and I am invoking it in this function我在这个 function 中调用它

void
convolve_component_EXTEND_cuda(const JSAMPLE *data,
                           ssize_t data_width,
                           ssize_t data_height,
                           const float *kern,
                           ssize_t kern_w_f,
                           ssize_t kern_h_f,
                           JSAMPLE *res)
{
JSAMPLE *d_data;
cudaMallocManaged(&d_data,
                  data_width * data_height * sizeof(JSAMPLE));
cudaMemcpy(d_data,
           data,
           data_width * data_height * sizeof(JSAMPLE),
           cudaMemcpyHostToDevice);

float *d_kern;
cudaMallocManaged(&d_kern,
                  (2 * kern_w_f + 1) * (2 * kern_h_f + 1) * sizeof(float));
cudaMemcpy(d_kern,
           kern,
           (2 * kern_w_f + 1) * (2 * kern_h_f + 1) * sizeof(float),
           cudaMemcpyHostToDevice);

JSAMPLE *d_res;
cudaMallocManaged(&d_res,
                  data_width * data_height * sizeof(JSAMPLE));

dim3 threadsPerBlock(16, 16);  // can be adjusted to 32, 32 (1024 threads per block is the maximum)
dim3 numBlocks(data_width / threadsPerBlock.x,
               data_height / threadsPerBlock.y);
convolve_component_EXTEND_kern<<<numBlocks, threadsPerBlock>>>(d_data,
                                                               data_width,
                                                               data_height,
                                                               d_kern,
                                                               kern_w_f,
                                                               kern_h_f,
                                                               d_res);

cudaDeviceSynchronize();

cudaMemcpy(d_res,
           res,
           data_width * data_height * sizeof(JSAMPLE),
           cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_kern);
cudaFree(d_res);
}

In this context, the image data is contained in the array called data in such a way that the pixel at (i, j) is accessed by indexing into the array at data_width * j + i.在这种情况下,图像数据包含在称为 data 的数组中,通过在 data_width * j + i 处对数组进行索引来访问 (i, j) 处的像素。 the kernel data is in the array called kern, and it has a width of 2 * kern_w_f + 1 and a height of 2 * kern_h_f + 1. The element at (i, j) is accessed by indexing into the kern array at (2 * w_f + 1) * j + i, just like the data array. kernel 数据位于名为 kern 的数组中,宽度为 2 * kern_w_f + 1,高度为 2 * kern_h_f + 1。 (i, j) 处的元素通过索引 (2) 处的 kern 数组来访问* w_f + 1) * j + i,就像数据数组一样。 The array res is used to store the result of the convolution, and is allocated using calloc() before being passed to the function.数组 res 用于存储卷积的结果,并在传递给 function 之前使用 calloc() 进行分配。

When I invoke the second function on an image's data, all the image's pixels are converted to 0 instead of the convolution being applied.当我在图像数据上调用第二个 function 时,图像的所有像素都转换为 0,而不是应用卷积。 Can anyone please point out the problem?任何人都可以指出问题吗?

Just after calling the kernel, and performing the convolution you try to copy your data back to the res array.在调用 kernel 并执行卷积之后,您尝试将数据复制回 res 数组。

cudaDeviceSynchronize();

cudaMemcpy(d_res,
       res,
       data_width * data_height * sizeof(JSAMPLE),
       cudaMemcpyDeviceToHost); 

this should be这应该是

cudaDeviceSynchronize();

cudaMemcpy(res,
       d_res,
       data_width * data_height * sizeof(JSAMPLE),
       cudaMemcpyDeviceToHost);

as the first argument of cudaMemcpy is the destination-pointer.因为 cudaMemcpy 的第一个参数是目标指针。

cudaError_t cudaMemcpy  ( void *dst, const void *src, size_t count, enum cudaMemcpyKind kind)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM