简体   繁体   English

如何确保与CUDA中的3D共享数据访问没有银行冲突

[英]How to ensure no bank conflict with 3D shared data access in CUDA

I'm using CUDA to do some operations on several large, three dimensional data sets of the same size, each consisting of floats. 我正在使用CUDA对相同大小的几个大型三维数据集进行一些操作,每个数据集都由浮点数组成。

Example below: 下面的例子:

out[i+j+k]=in_A[i+j+k]*out[i+j+k]-in_B[i+j+k]*(in_C[i+j+k+1]-in_C[i+j+k]);

where (numCols, numDepth refer to y and z dimensions of the 3D sets (eg out, in_A, in_C, etc) and: 其中(numCols,numDepth指的是3D集的y和z尺寸(例如out,in_A,in_C等),以及:

int tx=blockIdx.x*blockDim.x + threadIdx.x; int i=tx*numCols*numDepth;

int ty=blockIdx.y*blockDim.y + threadIdx.y; int j=ty*numDepth

int tz=blockIdx.z*blockDim.z + threadIdx.z; int k=tz;

I've set up my kernel to be run on (11,14,4) blocks with (8,8,8) threads in each block. 我将内核设置为在(11,14,4)个块上运行,每个块中都有(8,8,8)个线程。 Being setup this way, each thread corresponds to an element from each data set. 通过这种方式设置,每个线程对应于每个数据集中的一个元素。 To keep with the way I've setup my kernel, I am using 3D shared memory to reduce redundant global reads for in_C: 为了保持设置内核的方式,我使用3D共享内存来减少in_C的冗余全局读取:

(8x8x9 instead of 8x8x8 so that the very edge in_C[i+j+k+1] can be loaded as well) (8x8x9而不是8x8x8,因此也可以加载in_C[i+j+k+1]的最边缘)

__shared__ float s_inC[8][8][9];

There's other Stack Exchange posts ( ex link ) and CUDA docs that deal with 2D shared memory and describe what can be done to ensure there's no bank conflicts, such as padding the column dimension by one and accessing the shared array using threadIdx.y then threadIdx.x, but I couldn't find one that describes what happens when one uses the 3D case. 还有其他Stack Exchange帖子( ex link )和CUDA文档,它们处理2D共享内存并描述了如何确保没有存储体冲突,例如,将列维填充1,并使用threadIdx.y然后访问threadIdx访问共享数组。 .x,但是我找不到描述一个使用3D外壳时发生的情况的文件。

I would imagine that the same rules apply from the 2D case as to the 3D case, just by thinking of it in the 2D scheme being applied Z times. 我可以想象,仅在Z次应用2D方案中考虑相同的规则,就可以将相同的规则应用于2D情况和3D情况。

So by this thinking, accessing s_inC by: 因此,通过这种思考,可以通过以下方式访问s_inC

s_inC[threadIdx.z][threadIdx.y][threadIdx.x]=in_C[i+j+k];

would prevent threads in half warps from accessing the same bank at the same time, and the the shared memory should be declared as: 会阻止一半扭曲的线程同时访问同一存储体,并且共享内存应声明为:

__shared__ float s_inC[8][8+1][9];

(leaving out syncs, boundary checks, inclusion of the very edge case in_C[i+j+k+1], etc). (省去同步,边界检查,在in_C [i + j + k + 1]中包含最边缘情况等)。

Are the previous two assumptions correct and prevent bank conflicts? 前两个假设是否正确并防止银行冲突?

I'm using Fermi hardware, so there are 32 32bit shared memory banks 我使用的是Fermi硬件,所以有32个32位共享存储库

I think that your conclusions about bank conflict prevention are questionable. 我认为您关于预防银行冲突的结论值得怀疑。

Assuming 8x8x8 threads blocks, then an access like 假设有8x8x8线程块,那么像

__shared__ int shData[8][8][8];
...
shData[threadIdx.z][threadIdx.y][threadIdx.x] = ...

will give no bank conflict . 不会造成银行冲突

Opposite to this, with 8x8x8 threads blocks, then an access like 相反,使用8x8x8线程块,然后像

__shared__ int shData[8][9][9];
...
shData[threadIdx.z][threadIdx.y][threadIdx.x] = ...

will give bank conflicts . 会给银行带来冲突

This is illustrated by the figure below in which the yellow cells indicate threads from the same warp. 下图对此进行了说明,其中黄色单元格表示来自同一经线的线。 The figure reports, for each 32 bits bank, the thread accessing it as the tuple (threadIdx.x, threadIdy.y, threadIdz.z) . 该图针对每个32位存储区报告将其作为元组(threadIdx.x, threadIdy.y, threadIdz.z)访问的线程。 The red cells are the padding cells you are using which are not accessed by any thread. 红色单元格是您正在使用的填充单元格,任何线程都无法访问它们。

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM