CUDA线程和块

Question

I posted this on the NVIDIA forums, I thought I would get a few more eyes to help. 我将此内容发布在NVIDIA论坛上，我想我会得到更多帮助。

I'm having trouble trying to expand my code out to perform with multiple cases. 我在尝试扩展代码以在多种情况下执行时遇到了麻烦。 I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. 我一直在考虑最常见的情况进行开发，现在是测试的时候了，我需要确保所有情况都适用于不同的情况。 Currently my kernel is executed within a loop (there are reasons why we aren't doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. 目前，我的内核是在循环内执行的（有一些原因使我们没有做一个内核调用来完成全部工作。）来计算矩阵行中的值。 The most common case is 512 columns by 512 rows. 最常见的情况是512列乘512行。 I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. 我需要考虑尺寸为512 x 512、1024 x 512、512 x 1024以及其他组合的矩阵，但是最大的矩阵是1024 x 1024矩阵。 I have been using a rather simple kernel call: 我一直在使用一个相当简单的内核调用：

launchKernel<<<1,512>>>(................)

This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. 对于常见的512x512和512 x 1024（分别为列，行）情况，该内核工作正常，但对于1024 x 512情况则不能工作。 This case requires 1024 threads to execute. 这种情况下需要1024个线程来执行。 In my naivety i have been trying different versions of the simple kernel call to launch 1024 threads. 我天真地尝试了不同版本的简单内核调用来启动1024个线程。

launchKernel<<<2,512>>>(................)  // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???

I beleive my problem has something to do with my lack of understanding of the threads and blocks 我相信我的问题与我对线程和块的缺乏了解有关

Here is the output of deviceQuery, as you can see i can have a max of 1024 threads 这是deviceQuery的输出，如您所见，我最多可以有1024个线程

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2050"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2688 MBytes (2818572288 bytes)
  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores
  GPU Clock Speed:                               1.15 GHz
  Memory Clock rate:                             1500.00 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                Yes
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           40 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro 600"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors x (48) CUDA Cores/MP:     96 CUDA Cores
  GPU Clock Speed:                               1.28 GHz
  Memory Clock rate:                             800.00 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 131072 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           15 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600

I am using only the Tesla C2050 device Here is a stripped out version of my kernel, so you have an idea of what it is doing. 我仅使用Tesla C2050设备。这是我内核的精简版，因此您对它的功能有所了解。

#define twoPi               6.283185307179586
#define speed_of_light      3.0E8
#define MaxSize             999

__global__ void calcRx4CPP4
(  
        const float *array1,  
        const double *array2,  
        const float scalar1,  
        const float scalar2,  
        const float scalar3,  
        const float scalar4,  
        const float scalar5,  
        const float scalar6,  
        const int scalar7,  
        const int scalar8,    
        float *outputArray1,
        float *outputArray2)  
{  

    float scalar9;  
    int idx;  
    double scalar10;
    double scalar11;  
    float sumReal, sumImag;  
    float real, imag;  

    float coeff1, coeff2, coeff3, coeff4;  

    sumReal = 0.0;  
    sumImag = 0.0;  

    // kk loop 1 .. 512 (scalar7)  
    idx = (blockIdx.x * blockDim.x) + threadIdx.x;  

    /* Declare the shared memory parameters */
    __shared__ float SharedArray1[MaxSize];
    __shared__ double SharedArray2[MaxSize];

    /* populate the arrays on shared memory */
    SharedArray1[idx] = array1[idx];  // first 512 elements
    SharedArray2[idx] = array2[idx];
    if (idx+blockDim.x < MaxSize){
        SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
        SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
    }            
    __syncthreads();

    // input scalars used here.
    scalar10 = ...;
    scalar11 = ...;

    for (int kk = 0; kk < scalar8; kk++)
    {  
        /* some calculations */
        // SharedArray1, SharedArray2 and scalar9 used here
        sumReal = ...;
        sumImag = ...;
    }  


    /* calculation of the exponential of a complex number */
    real = ...;
    imag = ...;
    coeff1 = (sumReal * real);  
    coeff2 = (sumReal * imag);  
    coeff3 = (sumImag * real);  
    coeff4 = (sumImag * imag);  

    outputArray1[idx] = (coeff1 - coeff4);  
    outputArray2[idx] = (coeff2 + coeff3);  


}

Because my max threads per block is 1024, I thought I would be able to continue to use the simple kernel launch, am I wrong? 因为每个块的最大线程数为1024，所以我认为我可以继续使用简单的内核启动程序，对吗？

How do I successfully launch each kernel with 1024 threads? 如何成功启动具有1024个线程的每个内核？

Answer 1

You don't want to vary the number of threads per block. 您不想改变每个块的线程数。 You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. 您应该使用CUDA占用率计算器为您的内核获得每个块的最佳线程数。 After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. 获得该数字后，您只需启动所需的块数即可获得所需的线程总数。 If the number of threads that you need for a given case is not always a multiple of the threads per block, you add code in the top of your kernel to abort the unneeded threads. 如果给定情况下所需的线程数不总是每个块的线程数的倍数，则可以在内核顶部添加代码以中止不需要的线程。 ( if () return; ). （ if () return; ）。 Then, you pass in the dimensions of your matrix either with extra parameters to the kernel or by using x and y grid dimensions, depending on which information is required in your kernel (I haven't studied it). 然后，根据内核中所需的信息（我没有研究过），使用额外的参数将矩阵的尺寸传递给内核，或者使用x和y网格尺寸。

My guess is that the reason you're having trouble with 1024 threads is that, even though your GPU supports that many threads in a block, there is another limiting factor to the number of threads you can have in each block based on resource usage in your kernel. 我的猜测是您在1024个线程上遇到问题的原因是，即使您的GPU支持一个块中有这么多线程，但是基于资源使用情况，每个块中可以拥有的线程数还有另一个限制因素。您的内核。 The limiting factor can be shared memory or register usage. 限制因素可以是共享内存或寄存器使用情况。 The Occupancy Calculator will tell you which, though that information is only important if you want to optimize your kernel. 尽管仅当您要优化内核时该信息才重要，但“占用计算器”会告诉您哪些信息。

Answer 2

If you use one block with 1024 threads you will have problems since MaxSize is only 999 resulting in wrong data. 如果您使用一个具有1024个线程的块，则将遇到问题，因为MaxSize只有999，导致数据错误。

Lets simulate it for last thread #1023 让我们为最后一个线程＃1023对其进行仿真

__shared__ float SharedArray1[999];     
__shared__ double SharedArray2[999];

/* populate the arrays on shared memory */     
SharedArray1[1023] = array1[1023]; 
SharedArray2[1023] = array2[1023];     

if (2047 < MaxSize)
{         
    SharedArray1[2047] = array1[2047];         
    SharedArray2[2047] = array2[2047];     
}                 
__syncthreads();

If you now use all those elements in your calculation this should not work. 如果现在在计算中使用所有这些元素，则此方法将不起作用。 (Your calculation code is not shown so its an assumption) （您的计算代码未显示，因此是一个假设）

CUDA线程和块

问题描述

2 个解决方案

解决方案1
5 已采纳 2012-05-04 03:20:46

解决方案2
3 2012-05-04 11:54:13

CUDA线程和块

问题描述

2 个解决方案

解决方案1 5 已采纳 2012-05-04 03:20:46

解决方案2 3 2012-05-04 11:54:13

解决方案1
5 已采纳 2012-05-04 03:20:46

解决方案2
3 2012-05-04 11:54:13