2 variables into CUDA kernel

Question

I want to solve a matrix using a CUDA kernel. Matrix uses i and j indexes like this.

M[i*N+j]

Asuming that I want to copy elements from M to any other variable like M_temp, I should do this

M_temp[i*N+j] = M[i*N+j];

Well I have the next declaration for using blocks and Threads

dim3 grid = dim3(2, 1, 1);
dim3 block = dim3(10, 10, 1);

I don't know if I am wrong, but according to the prior declaration, I could have 100 Threads per block. 200 Threads in total.

Inside the kernel I want to use the indexes.

__global__ void kernel(double *M)
{
    int i = ???;
    int j = ???;

}

I would like to use at least 100 Threads per block, such that the maximum Matrix size would be:

M[100x100]

But I want to use

1 block for variable i

and

1 different block for variable j.

I have been thinking about using,

__global__ void kernel(double *M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.x * blockIdx.x;

    __syncthreads();
    M_temp[i*N+j] = M[i*N+j];        
}

But this way uses all the blocks in x. I don't know I'm confused. Help me please.

By the way my gpu is Geforce 610m.

Thank you

Answer 1

If you want to perform some operations with your 100x100 matrix, and you want each thread to deal with each entry, you need 10000 threads. As long as there is a limitations on the number of threads in 1 block (typically 1024 , or 32x32 ), you need to increase the grid size:

 dim3 grid = dim3(10, 10, 1);
 dim3 block = dim3(10, 10, 1);

Now inside your kernel you simply create i and j :

 i=blockIdx.x * blockDim.x + threadIdx.x;
 j=blockIdx.y * blockDim.y + threadIdx.y;
 M[i*N+j]=...

With our grid and block sizes, your blockDim.x=blockDim.y=10 and blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y range from 0 to 9 , so that your i and j range from 0 to 99 .

In general I usually follow cuda samples example. For arbitrary gridsize and blocksize, your kernel should look like this:

const int numThreadsx = blockDim.x * gridDim.x;
const int threadIDx = blockIdx.x * blockDim.x + threadIdx.x;

const int numThreadsy = blockDim.y * gridDim.y;
const int threadIDy = blockIdx.y * blockDim.y + threadIdx.y;


for (int i = threadIDx; i < N; i += numThreadsx)
    for (int j = threadIDy; j < N; j += numThreadsy)
        M[i*N+j]=...

Note, that you need to pass variable N into your kernel.

2 variables into CUDA kernel

Question

1 answers

solution1
2 ACCPTED 2016-04-25 16:13:29

2 variables into CUDA kernel

Question

1 answers

solution1 2 ACCPTED 2016-04-25 16:13:29

solution1
2 ACCPTED 2016-04-25 16:13:29