简体繁体 English

我应该何时更喜欢写入组合CUDA分配的映射主机内存？

[英]When should I prefer write-combined CUDA-allocated mapped host memory?

原文 2016-03-13 23:10:04 3 1 c++/ memory-management/ cuda/ gpgpu

the cudaHostAlloc() API call has, among others, the flags: cudaHostAlloc() API调用具有以下标志：

cudaHostAllocMapped: Maps the allocation into the CUDA address space. cudaHostAllocMapped：将分配映射到CUDA地址空间。 The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer(). 可以通过调用cudaHostGetDevicePointer（）获得指向存储器的设备指针。

cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). cudaHostAllocWriteCombined：将内存分配为写入组合（WC）。 WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. 在某些系统配置中， WC内存可以更快地通过PCI Express总线传输，但大多数CPU无法有效读取。 WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers. 对于将由CPU写入并由设备通过映射固定存储器或主机 - 设备传输读取的缓冲区，WC存储器是一个很好的选择。

I could quite understand when exactly I would prefer the "write-combined" option. 我完全理解我何时更喜欢“写合并”选项。 I mean, it didn't say the transfer may be faster just in one direction, so why do they only recommend it for one direction? 我的意思是，它没有说转移可能只是在一个方向上更快，所以他们为什么只推荐一个方向呢？ Also, which kind of systems benefit from this "write-combining"? 此外，哪种系统受益于这种“写入组合”？

I read this white paper, Section 4.7, and still could not get it. 我读了这篇白皮书，第4.7节，仍然无法得到它。 Ok, so reading by the CPU is inefficient; 好的，因此CPU的读取效率低下; but what if other benefits offset this inefficiency? 但如果其他好处抵消了这种低效率呢？ Or - if they cannot, why can't they? 或 - 如果他们不能，他们为什么不能？

An elucidation would be appreciated. 可以理解说明。

1 个解决方案

Write-combined memory allows the CPU to combine multiple narrow writes into fewer wider writes, thus increasing the efficiency of memory writes. 写入组合存储器允许CPU将多个窄写入组合成更少的更宽写入，从而提高存储器写入的效率。 If memory serves, WC memory was first introduced with the Intel PentiumPro around 1995 to speed up CPU writes into the frame buffer of video cards. 如果内存服务，WC内存最初是在1995年左右与Intel PentiumPro一起推出的，以加速CPU写入视频卡的帧缓冲区。 I am not up to speed on which modern system platforms use or support this. 我无法快速掌握哪些现代系统平台使用或支持这一点。

The efficiency of reads performed by the CPU is going to be the same for both cudaHostAllocMapped and cudaHostAllocWriteCombined . 对于cudaHostAllocMapped和cudaHostAllocWriteCombined ，CPU执行的读取效率将相同。 But because the latter allows more efficient writes by the CPU, it is recommended for "buffers that will be written by the CPU and read by the device", as stated by quoted documentation. 但由于后者允许CPU进行更高效的写入，因此建议使用“由CPU写入并由设备读取的缓冲区”，如引用文档所述。