简体   繁体   English

CUDA写入其他扭曲看不到的全局内存

[英]CUDA writes to global memory not seen by other warps

I was trying to explain global memory to someone who is new to CUDA. 我试图向刚加入CUDA的人解释全局内存。 I came up with the following dummy kernel that blocks other threads in other warps until a selected warp sets the global variable to another value: 我想出了以下虚拟内核,该内核在其他线程束中阻塞其他线程,直到选定的线程束将全局变量设置为另一个值为止:

__global__ void with_sync()
{
    while (threadIdx.x / 32 != 0)
    {
        if (is_done != 0)
        {
            break;
        }
    }

    if (threadIdx.x / 32 == 0)
    {
        is_done = 1;
        printf("I'm done!\n");
    }
}

The variable is_done is declared outside of the function as a __device__ __managed__ int (which, correct me if I'm wrong, means that the the variable will reside in global memory space. 变量is_done在函数外部声明为__device__ __managed__ int (如果我错了,请更正我,这意味着该变量将位于全局内存空间中。

However, when I execute this kernel (1024 1D threads in a single block) like so: 但是,当我执行该内核时(单个块中有1024个1D线程),如下所示:

with_sync<<<1, 1024>>>();
cudaDeviceSynchronize();

I'm done is printed out as expected. I'm done了,按预期方式打印了出来。 However, the CUDA program does not terminate (I placed cudaDeviceSynchronize() in the host code for it to wait for all threads). 但是,CUDA程序不会终止(我在主机代码中放置了cudaDeviceSynchronize()以便它等待所有线程)。 This leads me to wonder if the other warps did not receive the change in the is_done variable. 这使我想知道其他扭曲是否未在is_done变量中收到更改。 However, I understand that global memory implies that the value can be seen in device level (ie at the very least, all blocks in a grid). 但是,我知道全局内存意味着可以在设备级别看到该值(即,至少可以看到网格中的所有块)。

My question is the following: Is there any caching/optimisation done by CUDA that makes it such that this inconsistent global memory view can occur? 我的问题如下:CUDA是否进行了任何缓存/优化,以致可能发生这种不一致的全局内存视图? Is there a way to access the "latest" value from the variable that resides in global memory? 有没有办法从驻留在全局内存中的变量访问“最新”值?

Is there any caching/optimisation done by CUDA that makes it such that this inconsistent global memory view can occur? CUDA是否进行了任何缓存/优化,以致可能发生这种不一致的全局内存视图? Is there a way to access the "latest" value from the variable that resides in global memory? 有没有办法从驻留在全局内存中的变量访问“最新”值?

Yes, there is caching behavior. 是的,有缓存行为。 You can modify it with the volatile qualifier . 您可以使用volatile 限定符对其进行修改。

Here is a worked example: 这是一个工作示例:

$ cat t310.cu
#include <stdio.h>

#ifndef USE_VOLATILE
__device__ __managed__ int is_done = 0;
#else
__device__ volatile __managed__ int is_done = 0;
#endif

__global__ void with_sync()
{
    while (threadIdx.x / 32 != 0)
    {
        if (is_done != 0)
        {
            break;
        }
    }

    if (threadIdx.x / 32 == 0)
    {
        is_done = 1;
        printf("I'm done!\n");
    }
}



int main(){

  with_sync<<<1,1024>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t310 t310.cu
$ ./t310
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
^C
$ nvcc -o t310 t310.cu -DUSE_VOLATILE
$ ./t310
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
I'm done!
$

(In case its unclear the first run above was terminated by Ctrl-C, due to hang) (如果不清楚,则由于挂起,上面的第一次运行被Ctrl-C终止了)

Tesla P100 PCIE CUDA 10.0, CentOS 7 Tesla P100 PCIE CUDA 10.0,CentOS 7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM