2 个变量进入 CUDA 内核

Question

I want to solve a matrix using a CUDA kernel.我想使用 CUDA 内核求解矩阵。 Matrix uses i and j indexes like this.矩阵像这样使用 i 和 j 索引。

M[i*N+j]

Asuming that I want to copy elements from M to any other variable like M_temp, I should do this假设我想将元素从 M 复制到任何其他变量，如 M_temp，我应该这样做

M_temp[i*N+j] = M[i*N+j];

Well I have the next declaration for using blocks and Threads好吧，我有使用块和线程的下一个声明

dim3 grid = dim3(2, 1, 1);
dim3 block = dim3(10, 10, 1);

I don't know if I am wrong, but according to the prior declaration, I could have 100 Threads per block.我不知道我是否错了，但根据之前的声明，每个块我可以有 100 个线程。 200 Threads in total.总共 200 个线程。

Inside the kernel I want to use the indexes.在内核中，我想使用索引。

__global__ void kernel(double *M)
{
    int i = ???;
    int j = ???;

}

I would like to use at least 100 Threads per block, such that the maximum Matrix size would be:我想每个块至少使用 100 个线程，这样最大的矩阵大小将是：

M[100x100]

But I want to use但我想用

1 block for variable i

and和

1 different block for variable j.

I have been thinking about using,我一直在考虑使用，

__global__ void kernel(double *M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.x * blockIdx.x;

    __syncthreads();
    M_temp[i*N+j] = M[i*N+j];        
}

But this way uses all the blocks in x.但是这种方式使用了 x 中的所有块。 I don't know I'm confused.我不知道我糊涂了。 Help me please.请帮帮我。

By the way my gpu is Geforce 610m.顺便说一下，我的 GPU 是 Geforce 610m。

Thank you谢谢

Answer 1

If you want to perform some operations with your 100x100 matrix, and you want each thread to deal with each entry, you need 10000 threads.如果要对100x100矩阵执行某些操作，并且希望每个线程处理每个条目，则需要 10000 个线程。 As long as there is a limitations on the number of threads in 1 block (typically 1024 , or 32x32 ), you need to increase the grid size:只要对 1 个块中的线程数有限制（通常为1024或32x32 ），就需要增加网格大小：

 dim3 grid = dim3(10, 10, 1);
 dim3 block = dim3(10, 10, 1);

Now inside your kernel you simply create i and j :现在在您的内核中，您只需创建i和j ：

 i=blockIdx.x * blockDim.x + threadIdx.x;
 j=blockIdx.y * blockDim.y + threadIdx.y;
 M[i*N+j]=...

With our grid and block sizes, your blockDim.x=blockDim.y=10 and blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y range from 0 to 9 , so that your i and j range from 0 to 99 .使用我们的网格和块大小，您的blockDim.x=blockDim.y=10和blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y范围从0到9 ，因此您的i和j范围从0到99 .

In general I usually follow cuda samples example.一般来说，我通常遵循 cuda 示例示例。 For arbitrary gridsize and blocksize, your kernel should look like this:对于任意网格大小和块大小，您的内核应如下所示：

const int numThreadsx = blockDim.x * gridDim.x;
const int threadIDx = blockIdx.x * blockDim.x + threadIdx.x;

const int numThreadsy = blockDim.y * gridDim.y;
const int threadIDy = blockIdx.y * blockDim.y + threadIdx.y;


for (int i = threadIDx; i < N; i += numThreadsx)
    for (int j = threadIDy; j < N; j += numThreadsy)
        M[i*N+j]=...

Note, that you need to pass variable N into your kernel.请注意，您需要将变量N传递到内核中。

2 个变量进入 CUDA 内核

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-04-25 16:13:29

2 个变量进入 CUDA 内核

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-04-25 16:13:29

解决方案1
2 已采纳 2016-04-25 16:13:29