简体   繁体   English

这个Cuda扫描内核是仅在单个块内还是在多个块内工作?

[英]Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. 我正在做作业,并且已获得执行原始扫描操作的Cuda内核。 From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x ). 据我所知,如果使用单个块,则该内核将仅对数据进行扫描(因为int id = threadInx.x )。 Is this true? 这是真的?

//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
    //we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
    //cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
    //the size of this mememory allocation
    __shared__ int temp[1024 * 2];

    //instantiate variables
    int id = threadIdx.x;
    int pout = 0, pin = 1;

    // // load input into shared memory. 
    // // Exclusive scan: shift right by one and set first element to 0
    temp[id] = (id > 0) ? in_data[id - 1] : 0;
    __syncthreads();


    //for each thread, loop through each of the steps
    //each step, move the next resultant addition to the thread's 
    //corresponding space to manipulted for the next iteration
    for (int offset = 1; offset < numElements; offset <<= 1)
    {
        //these switch so that data can move back and fourth between the extra spaces
        pout = 1 - pout;
        pin = 1 - pout;

        //IF: the number needs to be added to something, make sure to add those contents with the contents of 
        //the element offset number of elements away, then move it to its corresponding space
        //ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
        if (id >= offset)
        {
            //this element needs to be added to something; do that and copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
        }
        else
        {
            //this element just drops down, so copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id];
        }
        __syncthreads();
    }

    // write output
    out_data[id] = temp[pout * numElements + id];
}

I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x . 我想修改此内核以使其跨多个块工作,我希望它像将int id...更改为int id = threadIdx.x + blockDim.x * blockIdx.x But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information. 但是共享内存仅在块内,这意味着跨块的扫描内核无法共享适当的信息。

From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). 据我所知,如果使用单个块,则该内核将仅对数据进行扫描(因为int id = threadInx.x)。 Is this true? 这是真的?

Not exactly. 不完全是。 This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated: 无论您启动多少块,该内核都可以工作,但是由于id的计算方式, 所有块都将获取相同的输入并计算相同的输出:

int id = threadIdx.x;

This id is independant of blockIdx , and therefore identical across blocks, no matter their number. id独立于blockIdx ,因此,无论块的数量如何,其在块之间均相同。


If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. 如果要在不更改太多代码的情况下进行此扫描的多块版本,则将引入一个辅助数组来存储每个块的总和。 Then, run a similar scan on that array, calculating per-block increments. 然后,对该阵列运行类似的扫描,计算每个块的增量。 Finally, run a last kernel to add those per-block increments to the block elements. 最后,运行最后一个内核以每个块的增量添加到块元素中。 If memory serves there is a similar kernel in the CUDA SDK samples. 如果有内存可用,则CUDA SDK示例中存在类似的内核。

Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl . 由于开普勒,上述代码可以更有效地重写,特别是通过使用__shfl Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. 此外,将算法更改为按扭曲而不是按块工作将摆脱__syncthreads并可能提高性能。 A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance. 这些改进的结合将使您摆脱共享内存,仅使用寄存器来获得最佳性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM