简体   繁体   English

如何在CUDA中安全地将全局内存中的数据加载到共享内存中?

[英]How to load data in global memory into shared memory SAFELY in CUDA?

My kernel: 我的内核:

__global__ void myKernel(float * devData, float * devVec, float * devStrFac,
int Natom, int vecNo) {

extern __shared__ float sdata[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;

float qx=devVec[3*idx];
float qy=devVec[3*idx+1];
float qz=devVec[3*idx+2];
__syncthreads();//sync_1

float c=0.0,s=0.0;
for (int iatom=0; iatom<Natom; iatom += blockDim.x) {
    float rtx = devData[3*(iatom + threadIdx.x)];//tag_0
    float rty = devData[3*(iatom + threadIdx.x)+1];
    float rtz = devData[3*(iatom + threadIdx.x)+2];
    __syncthreads();//sync_2
    sdata[3*threadIdx.x] = rtx;//tag_1
    sdata[3*threadIdx.x + 1] = rty;
    sdata[3*threadIdx.x + 2] = rtz;
    __syncthreads();//sync_3

    int end_offset=  min(blockDim.x, Natom - iatom);

    for (int cur_offset=0; cur_offset<end_offset; cur_offset++) {
        float rx = sdata[3*cur_offset];
        float ry = sdata[3*cur_offset + 1];
        float rz = sdata[3*cur_offset + 2];
        //sync_4  
        float theta = rx*qx + ry*qy + rz*qz;

        theta = theta - lrint  (theta);
        theta = theta * 2 * 3.1415926;//reduce theta to [-pi,pi]

        float ct,st;
        sincosf(theta,&st,&ct);

        c += ct;
        s += st;
    }

}

devStrFac[idx] += c*c + s*s;
}

why "__syncthreads()" labeled sync_2 is needed? 为什么需要标记为sync_2的“ __syncthreads()”? Without sync_2, sdata[] get wrong numbers and I get wrong results. 如果没有sync_2,则sdata []的数字错误,结果也错误。 Line "tag_1" use the results of line "tag_0", so in my mind sync_2 is no need. “ tag_1”行使用“ tag_0”行的结果,因此在我看来,sync_2是不需要的。 Where do I wrong? 我哪里错了? If due to disorderd instruction executing, I should put a __syncthreads() in line "sync_4"? 如果由于无序的指令执行,我应该在“ sync_4”行中放置__syncthreads()吗?

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. 考虑线程块的一个扭曲完成第一次迭代并开始下一个迭代,而其他扭曲仍在进行第一次迭代。 If you don't have __syncthreads at label sync2 , you will end up with this warp writing to shared memory while others are reading from that shared memory, which is race condition. 如果在标签__syncthreads处没有sync2 ,那么最终将以这种方式写入共享内存,而其他sync2则从该共享内存中读取内容,这是竞争条件。

You might move this __syncthreads() at label sync2 to the end of the outer loop for the sake of clarity. 为了清楚起见,您可以将标签sync2处的此__syncthreads()移动到外部循环的末尾。

"cuda-memcheck --tool racecheck" should tell you where the problem is. “ cuda-memcheck --tool racecheck”应该告诉您问题出在哪里。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM