计算没有结果

Question

I am trying to convolve an image using CUDA, but I cannot get a result.我正在尝试使用 CUDA 对图像进行卷积，但我无法得到结果。 cuda-gdb does not work properly on my system so I cannot tell what is happening inside the CUDA kernel. cuda-gdb 在我的系统上无法正常工作，所以我无法判断 CUDA kernel 内部发生了什么。 The CUDA kernel I am using is the following:我使用的 CUDA kernel 如下：

__global__
void
convolve_component_EXTEND_kern(const JSAMPLE *data, // image data
                           ssize_t data_width, // image width
                           ssize_t data_height, // image height
                           const float *kern, // convolution kernel data
                           ssize_t kern_w_f, // convolution kernel has a width of 2 * kern_w_f + 1
                           ssize_t kern_h_f, // convolution_kernel has a height of 2 * kern_h_f + 1
                           JSAMPLE *res) // array to store the result
{
ssize_t i = ::blockIdx.x * ::blockDim.x + ::threadIdx.x;
ssize_t j = ::blockIdx.y * ::blockDim.y + ::threadIdx.y;

float value = 0;

for (ssize_t m = 0; m < 2 * kern_w_f + 1; m++) {
    for (ssize_t n = 0; n < 2 * kern_h_f + 1; n++) {
            ssize_t x = i + m - kern_w_f; // column index for this contribution to convolution sum for (i, j)
            ssize_t y = j + n - kern_h_f; // row index for ...
            x = x < 0 ? 0 : (x >= data_width ? data_width - 1 : x);
            y = y < 0 ? 0 : (y >= data_height ? data_height - 1 : y);
            value += ((float) data[data_width * y + x]) * kern[(2 * kern_w_f + 1) * n + m];
    }
}

res[data_width * j + i] = (JSAMPLE) value;
}

and I am invoking it in this function我在这个 function 中调用它

void
convolve_component_EXTEND_cuda(const JSAMPLE *data,
                           ssize_t data_width,
                           ssize_t data_height,
                           const float *kern,
                           ssize_t kern_w_f,
                           ssize_t kern_h_f,
                           JSAMPLE *res)
{
JSAMPLE *d_data;
cudaMallocManaged(&d_data,
                  data_width * data_height * sizeof(JSAMPLE));
cudaMemcpy(d_data,
           data,
           data_width * data_height * sizeof(JSAMPLE),
           cudaMemcpyHostToDevice);

float *d_kern;
cudaMallocManaged(&d_kern,
                  (2 * kern_w_f + 1) * (2 * kern_h_f + 1) * sizeof(float));
cudaMemcpy(d_kern,
           kern,
           (2 * kern_w_f + 1) * (2 * kern_h_f + 1) * sizeof(float),
           cudaMemcpyHostToDevice);

JSAMPLE *d_res;
cudaMallocManaged(&d_res,
                  data_width * data_height * sizeof(JSAMPLE));

dim3 threadsPerBlock(16, 16);  // can be adjusted to 32, 32 (1024 threads per block is the maximum)
dim3 numBlocks(data_width / threadsPerBlock.x,
               data_height / threadsPerBlock.y);
convolve_component_EXTEND_kern<<<numBlocks, threadsPerBlock>>>(d_data,
                                                               data_width,
                                                               data_height,
                                                               d_kern,
                                                               kern_w_f,
                                                               kern_h_f,
                                                               d_res);

cudaDeviceSynchronize();

cudaMemcpy(d_res,
           res,
           data_width * data_height * sizeof(JSAMPLE),
           cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_kern);
cudaFree(d_res);
}

In this context, the image data is contained in the array called data in such a way that the pixel at (i, j) is accessed by indexing into the array at data_width * j + i.在这种情况下，图像数据包含在称为 data 的数组中，通过在 data_width * j + i 处对数组进行索引来访问 (i, j) 处的像素。 the kernel data is in the array called kern, and it has a width of 2 * kern_w_f + 1 and a height of 2 * kern_h_f + 1. The element at (i, j) is accessed by indexing into the kern array at (2 * w_f + 1) * j + i, just like the data array. kernel 数据位于名为 kern 的数组中，宽度为 2 * kern_w_f + 1，高度为 2 * kern_h_f + 1。 (i, j) 处的元素通过索引 (2) 处的 kern 数组来访问* w_f + 1) * j + i，就像数据数组一样。 The array res is used to store the result of the convolution, and is allocated using calloc() before being passed to the function.数组 res 用于存储卷积的结果，并在传递给 function 之前使用 calloc() 进行分配。

When I invoke the second function on an image's data, all the image's pixels are converted to 0 instead of the convolution being applied.当我在图像数据上调用第二个 function 时，图像的所有像素都转换为 0，而不是应用卷积。 Can anyone please point out the problem?任何人都可以指出问题吗？

Answer 1

Just after calling the kernel, and performing the convolution you try to copy your data back to the res array.在调用 kernel 并执行卷积之后，您尝试将数据复制回 res 数组。

cudaDeviceSynchronize();

cudaMemcpy(d_res,
       res,
       data_width * data_height * sizeof(JSAMPLE),
       cudaMemcpyDeviceToHost);

this should be这应该是

cudaDeviceSynchronize();

cudaMemcpy(res,
       d_res,
       data_width * data_height * sizeof(JSAMPLE),
       cudaMemcpyDeviceToHost);

as the first argument of cudaMemcpy is the destination-pointer.因为 cudaMemcpy 的第一个参数是目标指针。

cudaError_t cudaMemcpy  ( void *dst, const void *src, size_t count, enum cudaMemcpyKind kind)

计算没有结果

问题描述

1 个解决方案

解决方案1
1 2021-01-11 12:40:22

计算没有结果

问题描述

1 个解决方案

解决方案1 1 2021-01-11 12:40:22

解决方案1
1 2021-01-11 12:40:22