這個Cuda掃描內核是僅在單個塊內還是在多個塊內工作？

Question

我正在做作業，並且已獲得執行原始掃描操作的Cuda內核。 據我所知，如果使用單個塊，則該內核將僅對數據進行掃描（因為int id = threadInx.x ）。 這是真的？

//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
    //we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
    //cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
    //the size of this mememory allocation
    __shared__ int temp[1024 * 2];

    //instantiate variables
    int id = threadIdx.x;
    int pout = 0, pin = 1;

    // // load input into shared memory. 
    // // Exclusive scan: shift right by one and set first element to 0
    temp[id] = (id > 0) ? in_data[id - 1] : 0;
    __syncthreads();


    //for each thread, loop through each of the steps
    //each step, move the next resultant addition to the thread's 
    //corresponding space to manipulted for the next iteration
    for (int offset = 1; offset < numElements; offset <<= 1)
    {
        //these switch so that data can move back and fourth between the extra spaces
        pout = 1 - pout;
        pin = 1 - pout;

        //IF: the number needs to be added to something, make sure to add those contents with the contents of 
        //the element offset number of elements away, then move it to its corresponding space
        //ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
        if (id >= offset)
        {
            //this element needs to be added to something; do that and copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
        }
        else
        {
            //this element just drops down, so copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id];
        }
        __syncthreads();
    }

    // write output
    out_data[id] = temp[pout * numElements + id];
}

我想修改此內核以使其跨多個塊工作，我希望它像將int id...更改為int id = threadIdx.x + blockDim.x * blockIdx.x 。 但是共享內存僅在塊內，這意味着跨塊的掃描內核無法共享適當的信息。

Answer 1

據我所知，如果使用單個塊，則該內核將僅對數據進行掃描（因為int id = threadInx.x）。 這是真的？

不完全是。 無論您啟動多少塊，該內核都可以工作，但是由於id的計算方式， 所有塊都將獲取相同的輸入並計算相同的輸出：

int id = threadIdx.x;

該id獨立於blockIdx ，因此，無論塊的數量如何，其在塊之間均相同。

如果要在不更改太多代碼的情況下進行此掃描的多塊版本，則將引入一個輔助數組來存儲每個塊的總和。 然后，對該陣列運行類似的掃描，計算每個塊的增量。 最后，運行最后一個內核以將每個塊的增量添加到塊元素中。 如果有內存可用，則CUDA SDK示例中存在類似的內核。

由於開普勒，上述代碼可以更有效地重寫，特別是通過使用__shfl 。 此外，將算法更改為按扭曲而不是按塊工作將擺脫__syncthreads並可能提高性能。 這些改進的結合將使您擺脫共享內存，僅使用寄存器來獲得最佳性能。

這個Cuda掃描內核是僅在單個塊內還是在多個塊內工作？

問題描述

1 個解決方案

解決方案1
4 已采納 2014-10-08 18:52:57

這個Cuda掃描內核是僅在單個塊內還是在多個塊內工作？

問題描述

1 個解決方案

解決方案1 4 已采納 2014-10-08 18:52:57

解決方案1
4 已采納 2014-10-08 18:52:57