我是否需要在 CUDA 中跨多个 GPU 镜像输入缓冲区/纹理？

Question

TL;DR: Do I need to mirror read-only lookup textures and input buffers across multiple devices when doing Multi-GPU programming with CUDA (whether it is a strict requirement or for best performance)? TL;DR：在使用 CUDA 进行多 GPU 编程时，我是否需要在多个设备上镜像只读查找纹理和输入缓冲区（无论是严格要求还是为了最佳性能）？

I have a GPU kernel which takes in two textures for lookups and two (smallish) buffers for input data.我有一个 GPU 内核，它接收用于查找的两个纹理和用于输入数据的两个（较小的）缓冲区。

I've expanded my code to allow for multiple GPUs (our system will have a max of 8, but for testing I'm on a smaller dev system using only 2).我已经扩展了我的代码以允许使用多个 GPU（我们的系统最多有 8 个，但为了测试我在一个只使用 2 个的较小的开发系统上）。 Our system uses NVLINK and we have UVA enabled.我们的系统使用 NVLINK 并且我们启用了 UVA。

My setup involves making device 0 a sort of "master" or "root" device where the final result is stored and the final serial (serial as in only executable on one GPU) operations occur.我的设置涉及使设备 0 成为一种“主”或“根”设备，其中存储最终结果并进行最终串行（串行，仅在一个 GPU 上可执行）操作发生。 All devices are set up to allow peer access to dev 0. The kernel is invoked multiple times on each device in a loop of the form:所有设备都设置为允许对 dev 0 的对等访问。内核在每个设备上以以下形式的循环多次调用：

for(unsigned int f = 0; f < maxIterations; f++)
{
    unsigned int devNum = f % maxDevices; //maxIterations >> maxDevices
    cudaSetDevice(devNum);
    cudaDeviceSynchronize(); //Is this really needed?
    executeKernel<<<>>>(workBuffers[devNum], luTex1, luTex2, inputBufferA, inputBufferB);
    cudaMemcpyAsync(&bigGiantBufferOnDev0[f * bufferStride],
                     workBuffers[devNum],
                     sizeof(float) * bufferStride,
                     cudaMemcpyDeviceToDevice);
}

As one can see, each device has its own "work buffer" for writing out intermediate results, and these results are then memcpy'd to device 0.可以看出，每个设备都有自己的“工作缓冲区”，用于写出中间结果，然后将这些结果存储到设备 0。

The work (output) buffers are several orders of magnitude larger in size than the input buffers, and I noticed when I'd made a mistake and accessed buffers across devices that there was a major performance hit (presumably because kernels were accessing memory on another device).工作（输出）缓冲区的大小比输入缓冲区大几个数量级，我注意到当我犯了一个错误并跨设备访问缓冲区时，性能受到了重大影响（大概是因为内核正在访问另一个设备）。 I haven't however noticed a similar hit with the read only input buffers after fixing the output buffer issue.然而，在修复输出缓冲区问题后，我还没有注意到只读输入缓冲区的类似问题。

Which brings me to my question: Do I actually need to mirror these input buffers and textures across devices, or is there a caching mechanism that makes this unnecessary?这让我想到了我的问题：我是否真的需要跨设备镜像这些输入缓冲区和纹理，或者是否有缓存机制使这变得不必要？ Why do I notice such a massive performance hit when accessing the work buffers across devices, but seemingly incur no such penalty with the inputs buffers/textures?为什么我在跨设备访问工作缓冲区时会注意到如此巨大的性能下降，但似乎不会对输入缓冲区/纹理造成这样的损失？

Answer 1

Texturing , as well as ordinary global data access, is possible "remotely", if you have enabled peer access.如果您启用了对等访问，则可以“远程”进行纹理以及普通的全局数据访问。 Since such access occurs over NVLink (or the peer-capable fabric), it will generally be slower.由于此类访问是通过 NVLink（或具有对等功能的结构）进行的，因此它通常会更慢。

For "smallish" input buffers, it may be that the GPU caching mechanisms tend to reduce or mitigate the penalty associated with remote access.对于“小”输入缓冲区，GPU 缓存机制可能倾向于减少或减轻与远程访问相关的惩罚。 The GPU has specific read-only caches that are designed to help with read-only/input data, and of course the texturing mechanism has its own cache. GPU 有专门的只读缓存，旨在帮助处理只读/输入数据，当然纹理机制也有自己的缓存。 Detailed performance statements are not possible unless actual analysis is done with actual code.除非使用实际代码进行实际分析，否则无法进行详细的性能陈述。

Answer 2

If you use > Pascal level gpu, they have unified memory.如果使用 > Pascal 级别的 GPU，它们具有统一的内存。 you don't need data migration.您不需要数据迁移。

When code running on a CPU or GPU accesses data allocated this way (often called CUDA managed data), the CUDA system software and/or the hardware takes care of migrating memory pages to the memory of the accessing processor.当在 CPU 或 GPU 上运行的代码访问以这种方式分配的数据（通常称为 CUDA 管理数据）时，CUDA 系统软件和/或硬件负责将内存页面迁移到访问处理器的内存。

https://devblogs.nvidia.com/unified-memory-cuda-beginners/ https://devblogs.nvidia.com/unified-memory-cuda-beginners/

if you use the old school way to allocation buffer (cuMalloc), you do need to mirror data I think.如果你使用老派的方式来分配缓冲区（cuMalloc），我认为你确实需要镜像数据。

我是否需要在 CUDA 中跨多个 GPU 镜像输入缓冲区/纹理？

问题描述

2 个解决方案

解决方案1
2 2019-08-07 20:22:21

解决方案2
-1 2019-08-07 22:25:07

我是否需要在 CUDA 中跨多个 GPU 镜像输入缓冲区/纹理？

问题描述

2 个解决方案

解决方案1 2 2019-08-07 20:22:21

解决方案2 -1 2019-08-07 22:25:07

解决方案1
2 2019-08-07 20:22:21

解决方案2
-1 2019-08-07 22:25:07