简体   繁体   中英

2 variables into CUDA kernel

I want to solve a matrix using a CUDA kernel. Matrix uses i and j indexes like this.

M[i*N+j]

Asuming that I want to copy elements from M to any other variable like M_temp, I should do this

M_temp[i*N+j] = M[i*N+j];

Well I have the next declaration for using blocks and Threads

dim3 grid = dim3(2, 1, 1);
dim3 block = dim3(10, 10, 1);

I don't know if I am wrong, but according to the prior declaration, I could have 100 Threads per block. 200 Threads in total.

Inside the kernel I want to use the indexes.

__global__ void kernel(double *M)
{
    int i = ???;
    int j = ???;

}

I would like to use at least 100 Threads per block, such that the maximum Matrix size would be:

M[100x100]

But I want to use

1 block for variable i

and

1 different block for variable j.

I have been thinking about using,

__global__ void kernel(double *M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.x * blockIdx.x;

    __syncthreads();
    M_temp[i*N+j] = M[i*N+j];        
}

But this way uses all the blocks in x. I don't know I'm confused. Help me please.

By the way my gpu is Geforce 610m.

Thank you

If you want to perform some operations with your 100x100 matrix, and you want each thread to deal with each entry, you need 10000 threads. As long as there is a limitations on the number of threads in 1 block (typically 1024 , or 32x32 ), you need to increase the grid size:

 dim3 grid = dim3(10, 10, 1);
 dim3 block = dim3(10, 10, 1);

Now inside your kernel you simply create i and j :

 i=blockIdx.x * blockDim.x + threadIdx.x;
 j=blockIdx.y * blockDim.y + threadIdx.y;
 M[i*N+j]=...

With our grid and block sizes, your blockDim.x=blockDim.y=10 and blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y range from 0 to 9 , so that your i and j range from 0 to 99 .

In general I usually follow cuda samples example. For arbitrary gridsize and blocksize, your kernel should look like this:

const int numThreadsx = blockDim.x * gridDim.x;
const int threadIDx = blockIdx.x * blockDim.x + threadIdx.x;

const int numThreadsy = blockDim.y * gridDim.y;
const int threadIDy = blockIdx.y * blockDim.y + threadIdx.y;


for (int i = threadIDx; i < N; i += numThreadsx)
    for (int j = threadIDy; j < N; j += numThreadsy)
        M[i*N+j]=...

Note, that you need to pass variable N into your kernel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM