无法了解CUDA内核启动的行为

Question

#include "utils.h"

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
  for (size_t r = 0; r < numRows; ++r) {
    for (size_t c = 0; c < numCols; ++c) {
      uchar4 rgba = rgbaImage[r * numCols + c];
      float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
      greyImage[r * numCols + c] = channelSum;
    }
  }
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
  const dim3 blockSize(1, 1, 1);  //TODO
  const dim3 gridSize( 1, 1, 1);  //TODO
  rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);

  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

}

This is the code used for converting a color image to grayscale. 这是用于将彩色图像转换为灰度的代码。 I am working on this assignment for a course and have got these results after completing it . 我正在为课程completing it这项作业， completing it得到了这些结果。

A.
blockSize = (1, 1, 1)
gridSize = (1, 1, 1)
Your code ran in: 34.772705 msecs.

B.
blockSize = (numCols, 1, 1)
gridSize = (numRows, 1, 1)
Your code ran in: 1821.326416 msecs.

C.
blockSize = (numRows, 1, 1)
gridSize = (numCols, 1, 1)
Your code ran in: 1695.917480 msecs.

D.
blockSize = (1024, 1, 1)
gridSize = (170, 1, 1) [the image size is : r=313, c=557, blockSize*gridSize ~= r*c]
Your code ran in: 1709.109863 msecs.

I have tried a few more combinations but none got better performance than A. I got close with just a few ns of difference on increasing blocksize and gridsize by small values. 我尝试了更多的组合，但是没有一个比A更好的性能。在以小值增加块大小和网格大小方面，我只有几ns的差异。 Ex: 例如：

blockSize = (10, 1, 1)
gridSize = (10, 1, 1)
Your code ran in: 34.835167 msecs.

I dont understand why higher numbers dont get better performance and instead lead to worse performance. 我不明白为什么更高的数字不会获得更好的性能，而是导致更差的性能。 Also, it seems that increasing blocksize is better than gridsize. 同样，增加块大小似乎比网格大小更好。

Answer 1

You calculate all the pixels in every thread you launch, ie the kernel is completely serial. 您可以计算启动的每个线程中的所有像素，即内核是完全串行的。 Using more blocks or larger blocks is just repeating calculations. 使用更多的块或更大的块只是重复计算。 In the latter case, why not move the for loop out of the kernel and have each thread calculate one pixel? 在后一种情况下，为什么不将for循环移出内核并让每个线程计算一个像素？

无法了解CUDA内核启动的行为

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-01-01 11:06:09

无法了解CUDA内核启动的行为

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-01-01 11:06:09

解决方案1
1 已采纳 2017-01-01 11:06:09