简体   繁体   English

从一个内核启动到另一个内核,共享内存是否持久?

[英]is shared memory persistent from one kernel launch to another?

When trying to find whether shared memory can be accessed by multiple kernels, I have found that sometimes the data in shared memory are still there when accessing by another kernel, but sometimes not. 当尝试查找共享内存是否可以被多个内核访问时,我发现共享内存中的数据有时在被另一个内核访问时仍然存在,但有时却不存在。 What's more, when debugging the program by cuda-gdb, the data written in shared memory by the previous kernel can be ALWAYS read by the next kernels. 此外,在通过cuda-gdb调试程序时,上一个内核写入共享内存中的数据始终可以被下一个内核读取。

The following is a piece of test code, with 2gpus. 以下是一段2gpus的测试代码。

    extern __shared__ double f_ds[];
    __global__ void kernel_writeToSharedMem(double* f_dev, int spd_x)
    {
       int tid_dev_x = (blockDim.x * blockIdx.x + threadIdx.x);
       int tid_dev_y = (blockDim.y * blockIdx.y + threadIdx.y);
       int tid_dev = tid_dev_y* spd_x + tid_dev_x;

       if(tid_dev < blockDim.x * blockDim.y * gridDim.x*gridDim.y)
          f_ds[threadIdx.y*blockDim.x+threadIdx.x] = 0.12345;
       __syncthreads()
    }

  __global__ void kernel_readFromSharedMem(double *f_dev, int dev_no, int spd_x)
    {
       int tid_dev_x = (blockDim.x * blockIdx.x + threadIdx.x);
       int tid_dev_y = (blockDim.y * blockIdx.y + threadIdx.y);
       int tid_dev = tid_dev_y* spd_x + tid_dev_x;

       if(tid_dev < blockDim.x * blockDim.y * gridDim.x*gridDim.y)
         {
           f_dev[tid_dev] = f_ds[threadIdx.y*blockDim.x+threadIdx.x];
           printf("threadID %d in dev [%d] is having number %f\n",
                   tid_dev,dev_no,f_ds[threadIdx.y*blockDim.x+threadIdx.x]);
         }
       __syncthreads();
     }


    int main()
    {
     ...

       dim3 block_size(BLOCK_SIZE,BLOCK_SIZE);
       im3 grid_size(spd_x/BLOCK_SIZE,spd_y/BLOCK_SIZE);
       for(int i = 0; i < ngpus; i++)
         {
           cudaSetDevice(i);
           kernel_writeToSharedMem<<<grid_size,block_size,sizeof(double)*BLOCK_SIZE*BLOCK_SIZE,stream[i]>>>(f_dev[i],spd_x);
           cudaDeviceSynchronize();
           cudaThreadSynchronize();
          }
        for(int i = 0; i < ngpus; i++)
         {
           cudaSetDevice(i);
           kernel_reaFromSharedMem<<<grid_size,block_size,sizeof(double)*BLOCK_SIZE*BLOCK_SIZE,stream[i]>>>(f_dev[i], int i, spd_x);
           cudaDeviceSynchronize();
           cudaThreadSynchronize();
          }
      ...
    }

4 situation occurred after running the program. 运行程序后发生4种情况。

1)Dev0 are 0.12345 but Dev1 are 0; 1)Dev0为0.12345,但Dev1为0;

2) Dev0 are 0 but Dev1 are 0.12345; 2)Dev0为0但Dev1为0.12345;

3) Dev0 and Dev1 are all 0; 3)Dev0和Dev1均为0;

4) Dev0 and Dev1 are all 0.12345. 4)Dev0和Dev1均为0.12345。

When running in cuda-gdb 4) is always the case. 在cuda-gdb中运行时4)总是这样。

Does this indicate that the shared memory's persistent is only one kernel? 这是否表明共享内存的持久性只有一个内核? Would shared memory only be cleared or freed after one kernel OCCASIONALLY? 仅在一个内核偶尔发生后才清除或释放共享内存吗?

Shared memory is guaranteed to only have scope for the life of the block to which it is assigned. 保证共享内存仅在分配它的块的生存期内具有作用域。 Any attempt to re-use shared memory from block to block or kernel launch to kernel launch is completely undefined behaviour and should never be relied in a sane code design. 从块到块或从内核启动到内核启动再使用共享内存的任何尝试都是完全未定义的行为,并且永远不应依赖于明智的代码设计。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 CUDA共享内存-内核的总和减少 - CUDA shared memory - sum reduction from kernel cuda shared memory, no synchronisation in kernel, premature output from kernel - cuda shared memory, no synchronisation in kernel, premature output from kernel 如果为整个网格分配的共享内存量超过48kB,则内核启动失败 - Kernel launch failure if the amount of shared memory allocated for the whole grid exceeds 48kB CUDA内核启动要求动态分配超出硬件容量的共享内存量 - CUDA kernel launch requiring to dynamically allocate an amount of shared memory exceeding the hardware capacity 共享内存矩阵乘法内核 - Shared memory matrix multiplication kernel 在内核中声明共享内存变量 - declaring shared memory variables in a kernel 如何以编程方式确定持久 kernel 的正确启动参数? - How to programmatically determine the correct launch parameters for a persistent kernel? 只有一个执行内核的线程在CUDA中的共享内存中实现了数组的声明 - Does only one Thread executing a Kernel implement the declaration of an array in Shared Memory in CUDA Numba - CUDA kernel 中的共享 memory 未正确更新 - Numba - Shared memory in CUDA kernel not updating correctly 为什么将共享的 memory 数组填充一列会使 kernel 的速度提高 40%? - Why does padding the shared memory array by one column increase the speed of the kernel by 40%?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM