Question on the dimension of cuda block indexing

Question

In the following cuda code taken from book "Accelerating MATLAB with GPU computing: a primer with examples", I think

int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 1)
    return;

int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 1)
    return;

should actually be

int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 0 || row > numRows - 1)
    return;

int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 0 || col > numCols - 1)
    return;

Am I right? The following is the whole code that does image convolution using cuda code called from MATLAB.

#include "conv2Mex.h"

__global__ void conv2MexCuda(float* src,
                             float* dst,
                             int numRows,
                             int numCols,
                             float* mask)
{
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < 1 || row > numRows - 1)
        return;

    int col = blockIdx.y * blockDim.y + threadIdx.y;
    if (col < 1 || col > numCols - 1)
        return;

    int dstIndex = col * numRows + row;
    dst[dstIndex] = 0;
    int mskIndex = 3 * 3 - 1;
    for (int kc = -1; kc < 2; kc++)
    {
        int srcIndex = (col + kc) * numRows + row;
        for (int kr = -1; kr < 2; kr++)
        {
            dst[dstIndex] += mask[mskIndex--] * src[srcIndex + kr];
        }
    }
}

void conv2Mex(float* src, float* dst, int numRows, int numCols, float* msk)
{
    ...
    conv2MexCuda<<<gridSize, blockSize>>>...
    ...
}

Answer 1

Am I right?

I don't think you are right.

The construction of the row and col indices in the kernel code is such that they will vary (across threads in the grid) from 0 to numRows-1 and 0 to numCols-1 (and perhaps larger, depending on actual grid sizing, which you haven't shown).

Based on the code you have shown, the mask is evidently a 3x3 mask, which means that it acts as a stencil over the current (row, col) position, and extends plus and minus one row, and plus and minus one column. Let's take a careful look at the indexing here for the case where ( row , col ) = (0,0); this is one of the positions you have allowed to execute based on your proposed change:

for (int kc = -1; kc < 2; kc++)
{
    int srcIndex = (col + kc) * numRows + row;
    for (int kr = -1; kr < 2; kr++)
    {
        dst[dstIndex] += mask[mskIndex--] * src[srcIndex + kr];

At the first iteration of the outer for loop, kc will be -1, therefore srcIndex is (0-1)*numRows+0 . Let's assume numRows is reasonably large, like 256. So srcIndex is -1*256 or -256. At the first iteration of the inner for-loop, kr is -1, so the computed index for the access to src is -256-1 = -257. That is almost never sensible.

If anything, the upper bounds look incorrect to me. If we assume that the valid image index ranges are 0..numRows-1 and 0..numCols-1, then I think the restrictions should be as follows:

int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 2)
    return;

int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 2)
    return;

That appears to be the classic computer science off-by-1 error .

Question on the dimension of cuda block indexing

Question

1 answers

solution1
0 2021-12-14 14:59:46

Question on the dimension of cuda block indexing

Question

1 answers

solution1 0 2021-12-14 14:59:46

solution1
0 2021-12-14 14:59:46