[英]Does this Cuda scan kernel only work within a single block, or across multiple blocks?
我正在做作业,并且已获得执行原始扫描操作的Cuda内核。 据我所知,如果使用单个块,则该内核将仅对数据进行扫描(因为int id = threadInx.x
)。 这是真的?
//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
//we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
//cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
//the size of this mememory allocation
__shared__ int temp[1024 * 2];
//instantiate variables
int id = threadIdx.x;
int pout = 0, pin = 1;
// // load input into shared memory.
// // Exclusive scan: shift right by one and set first element to 0
temp[id] = (id > 0) ? in_data[id - 1] : 0;
__syncthreads();
//for each thread, loop through each of the steps
//each step, move the next resultant addition to the thread's
//corresponding space to manipulted for the next iteration
for (int offset = 1; offset < numElements; offset <<= 1)
{
//these switch so that data can move back and fourth between the extra spaces
pout = 1 - pout;
pin = 1 - pout;
//IF: the number needs to be added to something, make sure to add those contents with the contents of
//the element offset number of elements away, then move it to its corresponding space
//ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
if (id >= offset)
{
//this element needs to be added to something; do that and copy it over
temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
}
else
{
//this element just drops down, so copy it over
temp[pout * numElements + id] = temp[pin * numElements + id];
}
__syncthreads();
}
// write output
out_data[id] = temp[pout * numElements + id];
}
我想修改此内核以使其跨多个块工作,我希望它像将int id...
更改为int id = threadIdx.x + blockDim.x * blockIdx.x
。 但是共享内存仅在块内,这意味着跨块的扫描内核无法共享适当的信息。
据我所知,如果使用单个块,则该内核将仅对数据进行扫描(因为int id = threadInx.x)。 这是真的?
不完全是。 无论您启动多少块,该内核都可以工作,但是由于id
的计算方式, 所有块都将获取相同的输入并计算相同的输出:
int id = threadIdx.x;
该id
独立于blockIdx
,因此,无论块的数量如何,其在块之间均相同。
如果要在不更改太多代码的情况下进行此扫描的多块版本,则将引入一个辅助数组来存储每个块的总和。 然后,对该阵列运行类似的扫描,计算每个块的增量。 最后,运行最后一个内核以将每个块的增量添加到块元素中。 如果有内存可用,则CUDA SDK示例中存在类似的内核。
由于开普勒,上述代码可以更有效地重写,特别是通过使用__shfl
。 此外,将算法更改为按扭曲而不是按块工作将摆脱__syncthreads
并可能提高性能。 这些改进的结合将使您摆脱共享内存,仅使用寄存器来获得最佳性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.