简体   繁体   中英

Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x ). Is this true?

//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
    //we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
    //cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
    //the size of this mememory allocation
    __shared__ int temp[1024 * 2];

    //instantiate variables
    int id = threadIdx.x;
    int pout = 0, pin = 1;

    // // load input into shared memory. 
    // // Exclusive scan: shift right by one and set first element to 0
    temp[id] = (id > 0) ? in_data[id - 1] : 0;
    __syncthreads();


    //for each thread, loop through each of the steps
    //each step, move the next resultant addition to the thread's 
    //corresponding space to manipulted for the next iteration
    for (int offset = 1; offset < numElements; offset <<= 1)
    {
        //these switch so that data can move back and fourth between the extra spaces
        pout = 1 - pout;
        pin = 1 - pout;

        //IF: the number needs to be added to something, make sure to add those contents with the contents of 
        //the element offset number of elements away, then move it to its corresponding space
        //ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
        if (id >= offset)
        {
            //this element needs to be added to something; do that and copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
        }
        else
        {
            //this element just drops down, so copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id];
        }
        __syncthreads();
    }

    // write output
    out_data[id] = temp[pout * numElements + id];
}

I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x . But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information.

From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?

Not exactly. This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated:

int id = threadIdx.x;

This id is independant of blockIdx , and therefore identical across blocks, no matter their number.


If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. Then, run a similar scan on that array, calculating per-block increments. Finally, run a last kernel to add those per-block increments to the block elements. If memory serves there is a similar kernel in the CUDA SDK samples.

Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl . Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM