简体   繁体   中英

Do I need to mirror input buffers/textures across multiple GPUs in CUDA?

TL;DR: Do I need to mirror read-only lookup textures and input buffers across multiple devices when doing Multi-GPU programming with CUDA (whether it is a strict requirement or for best performance)?

I have a GPU kernel which takes in two textures for lookups and two (smallish) buffers for input data.

I've expanded my code to allow for multiple GPUs (our system will have a max of 8, but for testing I'm on a smaller dev system using only 2). Our system uses NVLINK and we have UVA enabled.

My setup involves making device 0 a sort of "master" or "root" device where the final result is stored and the final serial (serial as in only executable on one GPU) operations occur. All devices are set up to allow peer access to dev 0. The kernel is invoked multiple times on each device in a loop of the form:

for(unsigned int f = 0; f < maxIterations; f++)
{
    unsigned int devNum = f % maxDevices; //maxIterations >> maxDevices
    cudaSetDevice(devNum);
    cudaDeviceSynchronize(); //Is this really needed?
    executeKernel<<<>>>(workBuffers[devNum], luTex1, luTex2, inputBufferA, inputBufferB);
    cudaMemcpyAsync(&bigGiantBufferOnDev0[f * bufferStride],
                     workBuffers[devNum],
                     sizeof(float) * bufferStride,
                     cudaMemcpyDeviceToDevice);
}

As one can see, each device has its own "work buffer" for writing out intermediate results, and these results are then memcpy'd to device 0.

The work (output) buffers are several orders of magnitude larger in size than the input buffers, and I noticed when I'd made a mistake and accessed buffers across devices that there was a major performance hit (presumably because kernels were accessing memory on another device). I haven't however noticed a similar hit with the read only input buffers after fixing the output buffer issue.

Which brings me to my question: Do I actually need to mirror these input buffers and textures across devices, or is there a caching mechanism that makes this unnecessary? Why do I notice such a massive performance hit when accessing the work buffers across devices, but seemingly incur no such penalty with the inputs buffers/textures?

Texturing , as well as ordinary global data access, is possible "remotely", if you have enabled peer access. Since such access occurs over NVLink (or the peer-capable fabric), it will generally be slower.

For "smallish" input buffers, it may be that the GPU caching mechanisms tend to reduce or mitigate the penalty associated with remote access. The GPU has specific read-only caches that are designed to help with read-only/input data, and of course the texturing mechanism has its own cache. Detailed performance statements are not possible unless actual analysis is done with actual code.

If you use > Pascal level gpu, they have unified memory. you don't need data migration.

When code running on a CPU or GPU accesses data allocated this way (often called CUDA managed data), the CUDA system software and/or the hardware takes care of migrating memory pages to the memory of the accessing processor.

https://devblogs.nvidia.com/unified-memory-cuda-beginners/

if you use the old school way to allocation buffer (cuMalloc), you do need to mirror data I think.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM