简体繁体 English

CUDA中的全局内存与动态全局内存分配

[英]Global Memory vs. Dynamic Global Memory Allocation in CUDA

原文 2013-10-30 01:17:01 3 2 c++/ cuda

I have a CUDA (v5.5) application that will need to use global memory. 我有一个CUDA（v5.5）应用程序，将需要使用全局内存。 Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. 理想情况下，我更喜欢使用常量内存，但是我已经用完了常量内存，并且溢出必须放置在全局内存中。 I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory. 我还具有一些偶尔需要写入的变量（在GPU上执行了一些缩减操作之后），并将其放置在全局内存中。

For reading, I will be accessing the global memory in a simple way. 为了阅读，我将以一种简单的方式访问全局内存。 My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. 我的内核在for循环内被调用，并且在每次调用内核时，每个线程将访问没有偏移的完全相同的全局内存地址。 For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. 为了进行编写，在每次内核调用之后，都会在GPU上执行缩减操作，而我必须在循环的下一次迭代之前将结果写入全局内存。 There are far more reads from than writes to global memory in my application however. 但是，在我的应用程序中，对全局内存的读取远不止于写入。

My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? 我的问题是，与使用动态分配的全局内存相比，使用在全局（变量）范围内声明的全局内存是否有任何优势？ The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. 我所需的全局内存量将根据应用程序而变化，因此出于这个原因，动态分配将是更可取的。 I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. 我知道全局内存使用的上限，但是我更关注性能，因此也有可能使用较大的固定分配静态地声明内存，而我肯定不会溢出。 With performance in mind, is there any reason to prefer one form of global memory allocation over the other? 考虑到性能，是否有理由偏爱一种形式的全局内存分配而不是另一种形式？ Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms? 它们是否存在于GPU上的同一物理位置，是否以相同的方式缓存，或者两种形式的读取成本是否不同？

2 个解决方案

Global memory can be allocated statically (using __device__ ), dynamically (using device malloc or new ) and via the CUDA runtime (eg using cudaMalloc ). 全局内存可以静态分配（使用__device__ ），动态分配（使用设备malloc或new ）并通过CUDA运行时分配（例如，使用cudaMalloc ）。

All of the above methods allocate physically the same type of memory, ie memory carved out of the on-board (but not on-chip) DRAM subsystem. 所有上述方法在物理上都分配相同类型的内存，即，从板载（而非片上）DRAM子系统中划分出来的内存。 This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations). 无论如何分配，此内存都具有相同的访问，合并和缓存规则（因此具有相同的一般性能考虑）。

Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (ie __device__ ) method, or via the runtime API (ie cudaMalloc , etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code. 由于动态分配会花费一些非零的时间，因此可以在程序开始时使用静态（即__device__ ）方法或通过运行时API（即cudaMalloc ）执行一次分配，从而提高代码的性能。等），这避免了花时间在代码的性能敏感区域动态分配内存。

Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. 还要注意，我概述的3种方法，尽管从设备代码中具有类似C / C ++的访问方法，但与主机的访问方法却有所不同。 Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol , runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc ) is not directly accessible from the host. 使用诸如cudaMemcpyToSymbol和cudaMemcpyFromSymbol类的运行时API函数访问静态分配的内存，通过普通的cudaMalloc / cudaMemcpy类型函数访问运行时API分配的内存，并且不能从主机直接访问动态分配的全局内存（设备new和malloc ）。

First of all you need to think of coalescing the memory access. 首先，您需要考虑合并内存访问。 You didn't mention about the GPU you are using. 您没有提及正在使用的GPU。 In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. 在最新的GPU中，固定的内存读取将提供与恒定内存相同的性能。 So always make your memory read and write in coal laced manner as possible as you can. 因此，请始终使内存尽可能多地读写。

Another you can use texture memory (If the data size fits into it). 另一个可以使用纹理内存（如果数据大小合适）。 This texture memory has some caching mechanism. 该纹理内存具有某种缓存机制。 This is previously used in case when global memory read was non-coalesced. 以前在非全局读取全局存储器的情况下使用此方法。 But latest GPUs give almost same performance for texture and global memory. 但是最新的GPU在纹理和全局内存方面几乎提供了相同的性能。

I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. 我认为全局声明的内存不能提供比动态分配的全局内存更高的性能，因为合并问题仍然存在。 Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. 同样，在CUDA全局内存的情况下，在全局（变量）范围内声明的全局内存也是不可能的。 The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments. 可以在程序中全局声明的变量是常量内存变量和纹理，我们不需要将它们作为参数传递给内核。

for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations 有关内存优化的信息，请参见cuda c最佳做法指南中的内存优化部分，网址为http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations