简体   繁体   English

CUDA-没有块,只有用于未定义尺寸的线程

[英]CUDA - no blocks, just threads for undefined dimensions

I have some matrices with unknown sizes varying from 10-20.000 in both directions. 我有些矩阵的大小在两个方向上都在10-20.000之间不等。

I designed a CUDA kernel with (x;y) blocks and (x;y) threads. 我设计了一个具有(x; y)块和(x; y)线程的CUDA内核。

Since matrices width/height aren't multiple of my dimensions, it was a terrible pain to get things work and the code is becoming more and more complicated to get coalescence memory reads. 由于矩阵的宽度/高度不是我的尺寸的倍数,要使事情正常工作是一个非常痛苦的事情,而获取合并内存读取的代码也变得越来越复杂。

Besides all of that, the kernel is growing in size using more and more registers to check for correctness... so I think this is not the way I should adopt. 除此之外,内核越来越多,使用越来越多的寄存器来检查正确性……所以我认为这不是我应该采用的方法。

My question is: what if I totally eliminate blocks and just create a grid of x;y threads? 我的问题是:如果我完全消除块并仅创建x; y线程网格怎么办? Will a SM unit have problems without many blocks? 如果没有很多块,SM单元会出现问题吗?

Can I eliminate blocks and use a large amount of threads or is the block subdivision necessary? 我可以消除块并使用大量线程吗?还是需要对块进行细分?

You can't really just make a "grid of threads", since you have to organize threads into blocks and you can have a maximum of 512 threads per block. 您不能真的只是制作一个“线程网格”,因为您必须将线程组织成块,每个块最多可以有512个线程。 However, you could effectively do this by using 1 thread per block, which will result in a X by Y grid of 1x1 blocks. 但是,您可以通过每个块使用1个线程来有效地执行此操作,这将导致X x Y网格为1x1块。 However, this will result in pretty terrible performance due to several factors: 但是,由于以下几个原因,这将导致非常糟糕的性能:

  1. According to the CUDA Programming Guide, a SM can handle a maximum of 8 blocks at any time. 根据《 CUDA编程指南》,一个SM可以随时处理最多8个块。 This will limit you to 8 threads per SM, which isn't enough to fill even a single warp. 这样一来,每个SM最多只能有8个线程,不足以容纳一个线程。 If you have, say, 48 CUDA cores, you will only be able to handle 384 threads at any given time. 例如,如果您有48个CUDA内核,则您将只能在任何给定时间处理384个线程。

  2. With only 8 threads available on a SM, there will be too few warps to hide memory latencies. SM上只有8个线程可用,因此扭曲很少,无法隐藏内存延迟。 The GPU will spend most of its time waiting for memory accesses to complete, rather than doing any computations. GPU将花费大部分时间等待内存访问完成,而不是进行任何计算。

  3. You will be unable to coalesce memory reads and writes, resulting in poor memory bandwidth usage. 您将无法合并内存读取和写入,从而导致内存带宽使用率不佳。

  4. You will be effectively unable to leverage shared memory, as this is a shared resource between threads in a block. 您将实际上无法利用共享内存,因为这是块中线程之间的共享资源。

While having to ensure correctness for threads in a block is annoying, your performance will be vastly better than your "grid of threads" idea. 尽管必须确保块中线程的正确性很烦人,但您的性能将大大优于“线程网格”的想法。

Here's the code i use to divide a given task requiring num_threads into block and grid. 这是我用来将给定任务(需要num_threads划分为块和网格的代码。 Yes, you might end up launching to many blocks (but only very few) and you will probably end up having more actual threads than required, but it's easy and efficient this way. 是的,您可能最终启动了许多块(但只有很少的块),并且最终可能会拥有比所需更多的实际线程,但是这种方式既简单又有效。 See the second code example below for my simple in-kernel boundary check. 有关我的简单内核内边界检查,请参见下面的第二个代码示例。

PS: I always have block_size == 128 because it has been a good tradeoff between multicore occupancy, register usage, shared memory requirements and coalescent access for all of my kernels. PS:我始终使用block_size == 128因为在所有内核的多核占用率,寄存器使用率,共享内存要求和合并访问之间进行了很好的权衡。

Code to calculate a good grid size (host): 计算良好网格大小(主机)的代码:

#define GRID_SIZE 65535

//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {


    //block
    block->x = block_size;
    block->y = 1;
    block->z = 1;


    //number of blocks
    unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
    unsigned int total_threads = num_blocks * block_size;
    assert(total_threads >= num_threads);

    //calculate grid size
    unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
    unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
    unsigned int total_blocks = gx * gy;
    assert(total_blocks >= num_blocks);

    //grid
    grid->x = gx;
    grid->y = gy;
    grid->z = 1;
}

//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
    return (numerator + denominator - 1) / denominator;
}

Code to calculate the unique thread id and check boundaries (device): 用于计算唯一线程ID和检查边界(设备)的代码:

//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {


    //calculate unique id
    const unsigned int thread_id = threadIdx.x;
    const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
    const unsigned int unique_id = thread_id + block_id * blockDim.x;


    //check range
    if (unique_id >= num_threads) return;

    //do the actual work
    ...
}

I don't think that's a lot of effort/registers/lines-of-code to check for correctness. 我认为检查正确性不需要花费很多精力/寄存器/代码行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM