简体   繁体   English

Cuda cudaMemcpy 和 cudaMalloc

[英]Cuda cudaMemcpy and cudaMalloc

i always read that it is slow to allocate and transfer data form cpu to gpu.我总是读到将数据从 cpu 分配和传输到 gpu 的速度很慢。 is this because cudaMalloc is slow?这是因为 cudaMalloc 很慢吗? is it because cudaMemcpy is slow?是因为 cudaMemcpy 很慢吗? or is it becuase both of them are slow?还是因为他们俩都很慢?

It is mostly tied to 2 things, the first begin the speed of the PCIExpress bus between the card and the cpu.它主要与两件事有关,首先是卡和 cpu 之间的 PCIExpress 总线的速度。 The other is tied to the way these functions operate.另一个与这些功能的运作方式有关。 Now, I think the new CUDA 4 has better support for memory allocation (standard or pinned) and a way to access memory transparently across the bus.现在,我认为新的 CUDA 4 对 memory 分配(标准或固定)有更好的支持,并且可以通过总线透明地访问 memory。

Now, let's face it, at some point, you'll need to get data from point A to point B to compute something.现在,让我们面对现实吧,在某些时候,您需要从 A 点到 B 点获取数据来计算某些东西。 Best way to handle is to either have a really large computation going on or use CUDA streams to overlap transfer and computation on the GPU.处理的最佳方法是进行非常大的计算或使用 CUDA 流在 GPU 上重叠传输和计算。

In most applications, you should be doing cudaMalloc once at the beginning and then not call it any more.在大多数应用程序中,您应该在开始时执行一次 cudaMalloc,然后不再调用它。 Thus, the bottleneck is really cudaMemcpy.因此,瓶颈实际上是 cudaMemcpy。

This is due to physical limitations.这是由于物理限制。 For a standard PCI-E 2.0 x16 link, you'll get 8GB/s theoretical but typically 5-6GB/s in practice.对于标准的 PCI-E 2.0 x16 链接,理论上您将获得 8GB/s,但实际上通常为 5-6GB/s。 Compare this w/ even a mid range Fermi like the GTX460 which has 80+GB/s bandwidth on the device.将其与设备上具有 80+GB/s 带宽的 GTX460 等中档费米进行比较。 You're in effect taking an order of magnitude hit in memory bandwidth, spiking your data transfer times accordingly.您实际上在 memory 带宽中受到了一个数量级的影响,从而相应地加快了数据传输时间。

GPGPUs are supposed to be supercomputers and I believe Seymour Cray (the supercomputer guy) said, "a supercomputer turns compute-bound problems into I/O bound problems". GPGPU 应该是超级计算机,我相信 Seymour Cray(超级计算机专家)说过,“超级计算机将计算限制问题转化为 I/O 限制问题”。 Thus, optimizing data transfers is everything.因此,优化数据传输就是一切。

In my personal experience, iterative algorithms are the ones that by far show the best improvements by porting to GPGPU (2-3 orders of magnitude) due to the fact that you can eliminate transfer time by keeping everything in-situ on the GPU.以我个人的经验,迭代算法是迄今为止通过移植到 GPGPU(2-3 个数量级)显示出最佳改进的算法,因为您可以通过在 GPU 上保持原位来消除传输时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM