[英]Efficient and correct way to load array with halo from global to shared memory
I am facing a problem of loading arrays from global to shared memory with hallo 我面临着将数组从全局加载到带有Hallo的共享内存的问题
Here is the problem: I have a big array (256,64) in my global memory that i want to load to shared memory of size [16][16] In my computation I will need the neighbouring value (halo) 这是问题所在:我的全局内存中有一个大数组(256,64),我想加载到大小为[16] [16]的共享内存中。在我的计算中,我将需要相邻的值(光晕)
I find my self in a very diverged code thus very slow and at the end it does not work. 我发现自己的代码非常分散,因此速度非常慢,最终无法正常工作。 Here is my approach I will appreciate your advice
这是我的方法,谢谢您的建议
real, shared :: s_data(-1:16,-1:16)
d_j = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
d_l = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
tIdx = threadIdx%x -1
tIdy = threadIdx%y -1
bdimx = 256/(blockDim%x) !16
bdimy = 64/(blockDim%y) !8
d_l1=d_l+1
if(d_l1==d_lmax) d_l1=0
d_l0 = d_l -1
if(d_l==0) d_l0=d_lmax-1
call syncthreads()
!load the main part
s_data(tIdx,tIdy) = g_data(d_j,d_l)
!Filling halos
if(tIdx ==0)then
f(bx == 0) then
s_data(tIdx-1,tIdy) =0
else
s_data(tIdx-1,tIdy) = g_data(d_j-1,d_l)
end if
end if
!Fill (16,tIdy)
if(tIdx == blockDim%x-1)then
if(bx == bdmx-1) then
s_data(tIdx+1,tIdy) = 0
else
s_data(tIdx+1,tIdy) = g_data(d_j+1,d_l)
end if
end if
!Fill (-1,tIdy)
if(tIdy == 0)then
s_data(tIdx,tIdy+1)=g_data(d_j,d_l1)
end if
!Fill (N,tIdy)
if(tIdy == blockDim%y -1)then
s_data(tIdx,tIdy-1) = g_data(d_j,d_l0)
end if
!Fill (-1,-1) and (-1, N)
if(tIdx==0)then
if(bx == 0)then
if(tIdy == 0) then
s_data(tIdx-1,tIdy-1) =0
end if
if(tIdy == blockDim%y-1) then
s_data(tIdx-1,tIdy+1) = 0
end if
else
if(tIdy == 0) then
s_data(tIdx-1,tIdy-1) =g_data(d_j-1,d_l0)
end if
if(tIdy == blockDim%y) then
s_data(tIdx-1,tIdy+1) = g_data(d_j-1,d_l1)
end if
end if
end if
!Fill (N, -1) & (N,N)
if(tIdx==blockDim%x-1)then
if(bx == bdimx-1)then
if(tIdy == 0) then
s_data(tIdx+1,tIdy-1) = 0
end if
if(tIdy == blockDim%y) then
s_data(tIdx+1,tIdy+1) = 0
end if
else
if(tIdy == 0) then
s_data(tIdx+1,tIdy-1) =g_data(d_j+1,d_l0)
end if
if(dIdy == blockDim%y) then
s_data(tIdx+1,tIdy+1) = g_data(d_j+1,d_l1)
end if
end if
!do some computation with s_data !用s_data做一些计算
Box filters for image processing always involves halo data. 用于图像处理的盒式过滤器始终涉及光晕数据。 The basic idea is each output element/pixel is processed by one thread, and each thread loads more than one element/pixel to the shared mem.
基本思想是每个输出元素/像素由一个线程处理,并且每个线程将多个元素/像素加载到共享内存中。
This white paper about image convolution using CUDA could be a good reference. 这份有关使用CUDA进行图像卷积的白皮书可能是一个很好的参考。
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.