简体   繁体   English

我如何在 CUDA 中正确使用全局内存?

[英]How I use global memory correctly in CUDA?

I'm trying to do an application in CUDA which uses global memory defined with device .我正在尝试在 CUDA 中做一个应用程序,它使用用device定义的全局内存。 This variables are declared in a.cuh file.此变量在 .cuh 文件中声明。

In another file.cu is my main in which I do the cudaMallocs and the cudaMemCpy.在另一个 file.cu 中是我的主要文件,我在其中执行 cudaMallocs 和 cudaMemCpy。

That's a part of my code:那是我的代码的一部分:

cudaMalloc((void**)&varOne,*tam_varOne * sizeof(cuComplex));
cudaMemcpy(varOne,C_varOne,*tam_varOne * sizeof(cuComplex),cudaMemcpyHostToDevice);

varOne is declared in the.cuh file like this: varOne 在 .cuh 文件中声明如下:

    __device__ cuComplex *varOne;

When I launch my kernel (I'm not passing varOne as parameter) and try to read varOne with the debugger, it says that can't read the variable.当我启动我的内核(我没有将 varOne 作为参数传递)并尝试使用调试器读取 varOne 时,它说无法读取变量。 The pointer address it 000..0 so it's obviously that it is wrong.指针指向它 000..0 所以很明显它是错误的。

So, how I have to declare and copy the global memory in CUDA?那么,我必须如何在 CUDA 中声明和复制全局内存?

First, you need to declare the pointers to the data that will be copied from the CPU to the GPU.首先,您需要声明指向将从 CPU 复制到 GPU 的数据的指针。 In the example above, we want to copy the array original_cpu_array to CUDA global memory.在上面的示例中,我们要将数组original_cpu_array复制到 CUDA 全局内存。

int original_cpu_array[array_size];   
int *array_cuda;

Calculate the memory size that the data will occupy.计算数据将占用的内存大小。

int size = array_size * sizeof(int);

Cuda memory allocation: Cuda内存分配:

msg_erro[0] = cudaMalloc((void **)&array_cuda,size);

Copying from CPU to GPU:从 CPU 复制到 GPU:

msg_erro[0] = cudaMemcpy(array_cuda, original_cpu_array,size,cudaMemcpyHostToDevice);

Execute kernel执行内核

Copying from GPU to CPU:从 GPU 复制到 CPU:

msg_erro[0] = cudaMemcpy(original_cpu_array,array_cuda,size,cudaMemcpyDeviceToHost);

Free Memory:空闲内存:

cudaFree(array_cuda);

For debugging reasons, typically, I save the status of the functions calls in an array.出于调试原因,通常,我将函数调用的状态保存在一个数组中。 ( eg, cudaError_t msg_erro[var]; ). 例如, cudaError_t msg_erro[var]; )。 This is not strictly necessary, but it will save you some time if an error occurs during the allocation and memory transferences.这不是绝对必要的,但如果在分配和内存传输期间发生错误,它将为您节省一些时间。

And if errors do occur, I print them using a function like:如果确实发生错误,我会使用如下函数打印它们:

void printErros(cudaError_t *erros,int size, int flag)
{
 for(int i = 0; i < size; i++)
     if(erros[i] != 0)
     {
         if(flag == 0) printf("Alocacao de memoria");
         if(flag == 1) printf("CPU -> GPU  ");
         if(flag == 2) printf("GPU -> CPU  ");
         printf("{%d} => %s\n",i ,cudaGetErrorString(erros[i]));
     }
}

The flag is primarily to indicate the part in the code that the error occurred. flag主要是为了指明代码中发生错误的部分。 For instance, after a memory allocation:例如,在内存分配之后:

msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
printErros(msg_erro,msg_erro_size, 0);

I have experimented with some example and found that, you cannot directly use the global variable in the kernel without passing to it.我试验了一些例子,发现你不能直接在内核中使用全局变量而不传递给它。 Even though you initialize in.cuh file, you need to initialize in the main().即使您初始化了 in.cuh 文件,您也需要在 main() 中进行初始化。

Reason:原因:

  1. If you declare it globally, the Memory is not allocated in the GPU Global Memory.如果在全局声明,Memory 不会分配到 GPU Global Memory 中。 You need to use cudaMalloc((void**)&varOne,sizeof(cuComplex)) for the allocation of memory.您需要使用cudaMalloc((void**)&varOne,sizeof(cuComplex))来分配内存。 It can only allocate memory on GPU.它只能在 GPU 上分配内存。 The declaration __device__ cuComplex *varOne;声明__device__ cuComplex *varOne; works just as a prototype and variable declaration.就像原型和变量声明一样工作。 But, the memory is not allocated until cudaMalloc((void**)&varOne,sizeof(cuComplex)) is used.但是,在使用cudaMalloc((void**)&varOne,sizeof(cuComplex))之前不会分配内存。
  2. Also, you need to initialize the *varOne in main() as a Host pointer initially.此外,您需要首先将 main() 中的*varOne初始化为主机指针。 After using cudaMalloc() , it comes to know that the pointer is Device Pointer.使用cudaMalloc()后,得知该指针为 Device Pointer。

The sequence of steps are: (for my tested code)步骤的顺序是:(对于我测试的代码)

int *Ad;        //If you can allocate this in .cuh file, you dont need the shown code in main()

__global__ void Kernel(int *Ad){
....
}

int main(){
....
      int size=100*sizeof(int);
      cudaMalloc((void**)&Ad,size);
      cudaMemcpy(Ad,A,size,cudaMemcpyHostToDevice);
....
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM