如何推出CUDA内核？

Question

I have created a simple CUDA application to add two matrices. 我创建了一个简单的CUDA应用程序来添加两个矩阵。 It is compiling fine. 它编译得很好。 I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? 我想知道所有线程如何启动内核以及CUDA中的流程是什么？ I mean, in what fashion every thread will execute each element of the matrices. 我的意思是，每个线程以什么方式执行矩阵的每个元素。

I know this is a very basic concept, but I don't know this. 我知道这是一个非常基本的概念，但我不知道这一点。 I am confused regarding the flow. 关于流量我很困惑。

Answer 1

You launch a grid of blocks. 你启动一个块网格。

Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory). 块不可分割地分配给多处理器（多处理器上的块数决定了可用共享内存的数量）。

Blocks are further split into warps. 块进一步分为经线。 For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, eg by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). 对于费米GPU即32个线程，要么执行相同的指令或处于非活动状态（因为它们支远，例如通过从一个环路退出早于邻居相同的经纱内或不采取if他们确实）。 On a Fermi GPU at most two warps run on one multiprocessor at a time. 在Fermi GPU上，一次只能在一个多处理器上运行两个warp。

Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)). 每当存在延迟（即内存访问的执行停顿或数据依赖关系完成）时，运行另一个warp（适合一个多处理器的warp数 - 相同或不同的块 - 由每个使用的寄存器数决定）线程和/（块）使用的共享内存量。

This scheduling happens transparently. 这种调度是透明的。 That is, you do not have to think about it too much. 也就是说，你不必过多考虑它。 However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. 但是，您可能希望使用预定义的整数向量threadIdx （块中的我的线程在哪里？）， blockDim （一个块有多大？）， blockIdx （我的块在网格中哪里？）和gridDim （多大是网格？）在线程之间拆分工作（读取：输入和输出）。 You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic. 您可能还想了解如何有效地访问不同类型的内存（因此可以在单个事务中处理多个线程） - 但这是主题。

NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. NSight提供了一个图形调试器，让您在通过行话丛林后可以很好地了解设备上发生的情况。 Same goes for its profiler regarding those things you won't see in the debugger (eg stall reasons or memory pressure). 对于那些你在调试器中看不到的东西（例如失速原因或内存压力），它的剖析器也是如此。

You can synchronize all threads within the grid (all there are) by another kernel launch. 您可以通过另一个内核启动来同步网格中的所有线程（所有线程）。 For non-overlapping, sequential kernel execution no further synchronization is needed. 对于非重叠的顺序内核执行，不需要进一步的同步。

The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access). 一个网格中的线程（或一个内核运行 - 但是你想要调用它）可以使用原子操作（用于算术）或适当的内存栅栏（用于加载或存储访问）通过全局内存进行通信。

You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). 您可以使用内部指令__syncthreads()同步一个块内的所有线程（之后所有线程都将处于活动状态 - 尽管如此，最多两个warp可以在Fermi GPU上运行）。 The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access). 一个块中的线程可以使用原子操作（用于算术）或适当的内存栅栏（用于加载或存储访问）通过共享或全局内存进行通信。

As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. 如前所述，warp中的所有线程总是“同步”，尽管有些线程可能处于非活动状态。 They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). 它们可以通过共享或全局内存进行通信（或在具有计算能力3的即将到来的硬件上进行“通道交换”）。 You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). 您可以使用原子操作（用于算术）和volatile限定的共享或全局变量（在同一warp中按顺序加载或存储访问）。 The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads. volatile限定符告诉编译器始终访问内存，并且永远不会注册其他线程无法看到其状态。

Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums. 此外，还有一些warp-wide投票函数可以帮助您做出分支决策或计算整数（前缀）和。

OK, that's basically it. 好的，基本上就是这样。 Hope that helps. 希望有所帮助。 Had a good flow writing :-). 有一个很好的流程写作:-)。

Answer 2

Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4.. 让我们举一个添加4 * 4矩阵的例子..你有两个矩阵A和B，尺寸为4 * 4 ..

int main()
{
 int *a, *b, *c;            //To store your matrix A & B in RAM. Result will be stored in matrix C
 int *ad, *bd, *cd;         // To store matrices into GPU's RAM. 
 int N =4;                 //No of rows and columns.

 size_t size=sizeof(float)* N * N;

 a=(float*)malloc(size);     //Allocate space of RAM for matrix A
 b=(float*)malloc(size);     //Allocate space of RAM for matrix B

//allocate memory on device
  cudaMalloc(&ad,size);
  cudaMalloc(&bd,size);
  cudaMalloc(&cd,size);

//initialize host memory with its own indices
    for(i=0;i<N;i++)
      {
    for(j=0;j<N;j++)
         {
            a[i * N + j]=(float)(i * N + j);
            b[i * N + j]= -(float)(i * N + j);
         }
      }

//copy data from host memory to device memory
     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);

//calculate execution configuration 
   dim3 grid (1, 1, 1); 
   dim3 block (16, 1, 1);

//each block contains N * N threads, each thread calculates 1 data element

    add_matrices<<<grid, block>>>(ad, bd, cd, N);

   cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);  
   printf("Matrix A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",a[i*N+j]);
        printf("\n");
    }

   printf("\nMatrix B was---\n");
   for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",b[i*N+j]);
        printf("\n");
    }

    printf("\nAddition of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",c[i*N+j]);   //if correctly evaluated, all values will be 0
        printf("\n");
    }



    //deallocate host and device memories
    cudaFree(ad); 
    cudaFree(bd); 
    cudaFree (cd);

    free(a);
    free(b);
    free(c);

    getch();
    return 1;
}

/////Kernel Part

__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
  int index;
  index = blockIDx.x * blockDim.x + threadIDx.x            

  cd[index] = ad[index] + bd[index];
}

Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16.. 让我们举一个16 * 16矩阵的例子..你有两个矩阵A和B，尺寸为16 * 16 ..

First of all you have to decide your thread configuration. 首先，您必须决定您的线程配置。 You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device. 您可以启动一个内核函数，它将执行矩阵加法的并行计算，这将在您的GPU设备上执行。

Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. 现在，使用一个内核函数启动一个网格。网格最多可以有65,535个块，可以以三维方式排列。 (65535 * 65535 * 65535). （65535 * 65535 * 65535）。

Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64) 网格中的每个块最多可以有1024个线程。这些线程也可以以三维方式排列（1024 * 1024 * 64）

Now our problem is addition of 16 * 16 matrices.. 现在我们的问题是增加16 * 16矩阵..

A | 1  2  3  4 |        B | 1  2  3  4 |      C| 1  2  3  4 |
  | 5  6  7  8 |   +      | 5  6  7  8 |   =   | 5  6  7  8 | 
  | 9 10 11 12 |          | 9 10 11 12 |       | 9 10 11 12 |  
  | 13 14 15 16|          | 13 14 15 16|       | 13 14 15 16|

We need 16 threads to perform the computation. 我们需要16个线程来执行计算。

i.e. A(1,1) + B (1,1) = C(1,1)
     A(1,2) + B (1,2) = C(1,2) 
     .        .          .
     .        .          . 
     A(4,4) + B (4,4) = C(4,4)

All these threads will get executed simultaneously. 所有这些线程将同时执行。 So we need a block with 16 threads. 所以我们需要一个包含16个线程的块。 For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads. 为了方便起见，我们将在一个块中以（16 * 1 * 1）方式排列线程因为没有线程是16所以我们只需要一个块来存储这16个线程。

so, grid configuration will be dim3 Grid(1,1,1) ie grid will have only one block and block configuration will be dim3 block(16,1,1) ie block will have 16 threads arranged column wise. 所以，网格配置将是dim3 Grid(1,1,1)即网格将只有一个块，并且块配置将是dim3 block(16,1,1)即块将有16个列排列的线程。

Following program will give you the clear idea about its execution.. Understanding the indexing part(ie threadIDs, blockDim, blockID) is the important part. 以下程序将为您提供有关其执行的清晰概念。了解索引部分（即threadIDs，blockDim，blockID）是重要的部分。 You need to go through the CUDA literature. 你需要通过CUDA文献。 Once you have clear idea about indexing, you will win the half battle! 一旦你对索引有了清晰的认识，你将赢得半场战！ So spend some time with cuda books, different algorithms and paper-pencil of course! 因此，花一些时间与cuda书籍，不同的算法和纸笔当然！

Answer 3

试试'Cuda-gdb' ，这是CUDA调试器。

如何推出CUDA内核？

问题描述

3 个解决方案

解决方案1
11 已采纳 2012-08-29 22:50:18

解决方案2
8 2012-10-18 06:25:21

解决方案3
1 2012-08-29 14:52:23

如何推出CUDA内核？

问题描述

3 个解决方案

解决方案1 11 已采纳 2012-08-29 22:50:18

解决方案2 8 2012-10-18 06:25:21

解决方案3 1 2012-08-29 14:52:23

解决方案1
11 已采纳 2012-08-29 22:50:18

解决方案2
8 2012-10-18 06:25:21

解决方案3
1 2012-08-29 14:52:23