简体   繁体   English

启动GPU内核以对3D数据集进行计算的最佳方法是什么?

[英]What is the best way to launch a GPU kernel to do calculation on a 3D data set?

I am using CUDA to do calculations on a potentially large 3D data set. 我正在使用CUDA对可能较大的3D数据集进行计算。 I think it is best to see a short code snippet first: 我认为最好先看到一个简短的代码片段:

void launch_kernel(/*arguments . . . */){
    int bx = xend-xstart, by = yend-ystart, bz = zend-zstart;

    dim3 blocks(/*dimensions*/);
    dim3 threads(/*dimensions*/);
    kernel<<blocks, threads>>();
}

I have a 3D set of cells and I need to launch a kernel to compute each one. 我有一组3D单元,我需要启动一个内核来计算每个单元。 The problem is that the input size may exceed the capabilities of the GPU, specifically the threads. 问题在于输入大小可能超过GPU的能力,尤其是线程。 So code like this: 所以这样的代码:

void launch_kernel(/*arguments . . . */){
       int bx = xend-xstart, by = yend-ystart, bz = zend-zstart;

       dim3 blocks(bx,by,1);
       dim3 threads(bz);
       kernel<<blocks, threads>>();
   }

... doesn't work well. ...效果不好。 Because what if the dimensions are 1000x1000x1000? 因为尺寸为1000x1000x1000怎么办? - I can't launch 1000 threads per block. -我无法在每个块中启动1000个线程。 Or even better, what if the dimensions are 5x5x1000? 甚至更好,如果尺寸为5x5x1000? - Now I am barely launching any blocks, but the kernel would need to be launched 5x5x512 b/c of the hardware and each thread would do 2 calculations. -现在我几乎没有启动任何块,但是需要以硬件的5x5x512 b / c启动内核,并且每个线程将执行2次计算。 I also can't just mash up all my dimensions, putting some of the z's in the blocks and some in the threads b/c I need to be able to identify the dimensions in the kernel. 我也不能仅仅将所有维度混合在一起,将一些z放入块中,而将某些z放入线程b / c中,我需要能够识别内核中的维度。 Currently: 目前:

__global__ void kernel(/*arguments*/){
    int x = xstart + blockIdx.x;
    int y = ystart + blockIdx.y;
    int z = zstart + threadIdx.x;
    if(x < xend && y < yend && z < zend){
        //calculate
    }
}

I need a solid, efficient way to figure out these variables: 我需要一种可靠,有效的方法来找出这些变量:

the block x dimension, block y dimensions, thread x (and y? and z?), the x,y,z once I am in the kernel through the blockIdx and threadIdx, and, if the input exceeds hardware, the amount of a "step" I take for each dimension in a for loop inside my kernel calculation. 块x尺寸,块y尺寸,线程x(以及y?和z?),一旦我通过blockIdx和threadIdx进入内核时的x,y,z,并且,如果输入超出硬件,则a的数量我在内核计算中的for循环中为每个维度选择“步骤”。

If you have a questions, please ask. 如有疑问,请询问。 This is a difficult question, and it has been troubling me (especially since the amount of blocks/threads I launch is a major component of performance). 这是一个棘手的问题,这一直困扰着我(特别是因为我启动的块/线程数量是性能的主要组成部分)。 This code needs to be automated in its decisions for different data sets, and I am not sure how to do that efficiently. 对于不同的数据集,此代码的决策需要自动化,但我不确定如何有效地做到这一点。 Thank you in advance. 先感谢您。

I think you are vastly over complicating things here. 我认为您在很大程度上使这里的事情复杂化。 The basic problem seems to be that you need to run a kernel on a 1000 x 1000 x 1000 computational domain. 基本问题似乎是您需要在1000 x 1000 x 1000计算域上运行内核。 So you require 1000000000 threads, which is well within the capabilities of all CUDA compatible hardware. 因此,您需要1000000000个线程,这完全在所有CUDA兼容硬件的能力之内。 So just use a standard 2D CUDA execution grid with at least the number of threads needed to do the computation (if you don't understand how to do that leave a comment and I will add it to the answer) and then inside your kernel call a little setup function something like this: 因此,只需使用至少具有执行计算所需线程数的标准2D CUDA执行网格(如果您不知道该怎么做,请在注释中添加注释,然后将其添加到答案中),然后在内核调用中使用一点设置功能是这样的:

__device__ dim3 thread3d(const int dimx, const int dimxy)
{
    // The dimensions of the logical computational domain are (dimx,dimy,dimz)
    // and dimxy = dimx * dimy
    int tidx = threadIdx.x + blockIdx.x * blockDim.x;
    int tidy = threadIdx.y + blockIdx.y * blockDim.y;
    int tidxy = tidx + gridDim.x * tidy;

    dim3 id3d;
    id3d.z = tidxy / dimxy;
    id3d.y = tidxy / (id3d.z * dimxy);
    id3d.x = tidxy - (id3d.z * dimxy - id3d.y * dimx);

    return id3d;
}

[disclaimer: written in browser, never compiled, never run, never tested. [免责声明:在浏览器中编写,请勿编译,请勿运行,未经测试。 Use at own risk]. 使用风险自负]。

This function will return "logical" thread coordinates in the 3D domain (dimx,dimy,dimz) from a CUDA 2D execution grid. 此函数将从CUDA 2D执行网格返回3D域(dimx,dimy,dimz)中的“逻辑”线程坐标。 Call it at the beginning of the kernel something like this: 在内核开始时调用它,如下所示:

__global__ void kernel(arglist, const int dimx, const int dimxy)
{
    dim3 tid = thread3d(dimx, dimxy);

    // tid.{xyx} now contain unique 3D coordinates on the (dimx,dimy,dimz) domain
    .....
}

Note that there is a lot of integer computational overhead in getting that grid set up, so you might want to think about why you really need a 3D grid. 请注意,建立该网格有很多整数计算开销,因此您可能要考虑为什么真正需要3D网格。 You would be surprised at the number of times it isn't actually necessary and much of that set up overhead can be avoided. 您会感到惊讶的是,实际上并不需要很多次,并且可以避免很多设置开销。

I would first use cudaGetDeviceProperties() to find the compute capability of your GPU so you know exactly how many threads per block are allowed for your GPU (if your program needs to be generalized such that it can run on any CUDA capable device). 我将首先使用cudaGetDeviceProperties()查找GPU的计算能力,以便您确切知道GPU允许每个块有多少个线程(如果您的程序需要通用化以便可以在任何支持CUDA的设备上运行)。

Then, using that number, I would make a big nested if statement testing the dimensions of your input. 然后,使用该数字,我将在测试输入尺寸的if语句中进行大量嵌套。 If all of the dimensions are sufficiently small, you can have one block of (bx,by,bz) threads (unlikely). 如果所有尺寸都足够小,则可以(不大可能)有一个(bx,by,bz)线程块。 If that doesn't work, then find the largest dimension (or two dimensions) that can fit into one block and partition according to that. 如果这不起作用,则找到可以放入一个块并根据该块进行分区的最大尺寸(或两个尺寸)。 If that doesn't work, then you'll have to partition the smallest dimension such that some chunk of it fits into one block - such as (MAX_NUMBER_THREADS_PER_BLOCK,1,1) threads and (bx/MAX_NUMBER)THREADS_PER_BLOCK,by,bz) blocks assuming bx<by<bz and bx>MAX_NUMBER_THREADS_PER_BLOCK . 如果这不起作用,则必须对最小尺寸进行分区,以使其一部分适合一个块,例如(MAX_NUMBER_THREADS_PER_BLOCK,1,1)线程和(bx/MAX_NUMBER)THREADS_PER_BLOCK,by,bz)假设bx<by<bzbx>MAX_NUMBER_THREADS_PER_BLOCK

You'll need different kernels for each case, which is a bit of a pain but at the end of the day its a doable job. 对于每种情况,您将需要不同的内核,这有点麻烦,但最终还是可以完成的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM