简体繁体 English

关于 cudaMemcpyAsync 函数

[英]About cudaMemcpyAsync Function

原文 2012-12-06 11:51:52 6 1 cuda

I have some questions.我有一些问题。

Recently I'm making a program by using CUDA.最近我正在使用 CUDA 制作一个程序。

In my program, there is one big data on Host programmed with std::map(string, vector(int)).在我的程序中，Host 上有一个用 std::map(string, vector(int)) 编程的大数据。

By using these datas some vector(int) are copied to GPUs global memory and processed on GPU通过使用这些数据，一些向量（int）被复制到 GPU 全局内存并在 GPU 上处理

After processing, some results are generated on GPU and these results are copied to CPU.处理后，在 GPU 上生成一些结果，并将这些结果复制到 CPU。

These are all my program schedule.这些都是我的节目表。

cudaMemcpy( ... , cudaMemcpyHostToDevice) cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)内核函数（内核函数只有在必要的数据被复制到 GPU 全局内存时才能完成）
cudaMemcpy( ... , cudaMemcpyDeviceToHost) cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )重复 1~3 步 1000 次（对于另一个数据（向量））

But I want to reduce processing time.但我想减少处理时间。

So I decided to use cudaMemcpyAsync function in my program.所以我决定在我的程序中使用 cudaMemcpyAsync 函数。

After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.在搜索了一些文档和网页后，我意识到要使用 cudaMemcpyAsync 函数主机内存，其中有数据要复制到 GPU 全局内存必须分配为固定内存。

But my programs are using std::map, so I couldn't make this std::map data to pinned memory.但是我的程序正在使用std::map，所以我不能把这个std::map数据放到固定内存中。

So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.所以我没有使用这个，而是创建了一个缓冲区数组类型的固定内存，这个缓冲区总是可以处理复制向量的所有情况。

Finally, my program worked like this.最后，我的程序是这样工作的。

Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer) Memcpy（使用循环将数据从 std::map 复制到缓冲区，直到将整个数据复制到缓冲区）
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice) cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)内核（内核函数只有在将整个数据复制到 GPU 全局内存时才能执行）
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost) cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )重复 1~4steps 1000 次（对于另一个数据（向量））

And my program became much faster than the previous case.我的程序变得比以前的情况快得多。

But problem(my curiosity) is at this point.但问题（我的好奇心）就在这一点上。

I tried to make another program in a similar way.我试图以类似的方式制作另一个程序。

Memcpy (copy data from std::map to buffer only for one vector) Memcpy（仅将数据从 std::map 复制到一个向量的缓冲区）
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice) cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory循环 1~2 直到整个数据复制到 GPU 全局内存
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)内核（内核函数只有在必要的数据被复制到 GPU 全局内存时才能执行）
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost) cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )重复 1~5steps 1000 次（对于另一个数据（向量））

This method came out to be about 10% faster than the method discussed above.这种方法比上面讨论的方法快了大约 10%。

But I don't know why.但我不知道为什么。

I think cudaMemcpyAsync only can be overlapped with kernel function.我认为 cudaMemcpyAsync 只能与内核函数重叠。

But my case I think it is not.但我的情况我认为不是。 Rather than it looks like can be overlapped between cudaMemcpyAsync functions.而不是看起来可以在 cudaMemcpyAsync 函数之间重叠。

Sorry for my long question but I really want to know why.对不起我的长问题，但我真的很想知道为什么。

Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?有人可以教我或向我解释什么是确切的设施“cudaMemcpyAsync”以及哪些功能可以与“cudaMemcpyAsync”重叠？

1 个解决方案

The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. cudaMemcpyAsync 的复制活动（以及内核活动）可以与任何主机代码重叠。 Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity.此外，与设备之间的数据复制（通过 cudaMemcpyAsync）可以与内核活动重叠。 All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.所有 3 个活动：主机活动、数据复制活动和内核活动，可以相互异步完成，并且可以相互重叠。

As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync.正如您所看到和演示的，主机活动和数据复制或内核活动可以以相对简单的方式相互重叠：内核启动立即返回到主机，cudaMemcpyAsync 也是如此。 However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts.然而，为了获得数据复制和内核活动之间的最佳重叠机会，有必要使用一些额外的概念。 For best overlap opportunities, we need:为了获得最佳重叠机会，我们需要：

Host memory buffers that are pinned, eg via cudaHostAlloc()固定的主机内存缓冲区，例如通过 cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)使用 cuda 流来分离各种类型的活动（数据复制和内核计算）
Usage of cudaMemcpyAsync (instead of cudaMemcpy) cudaMemcpyAsync 的使用（而不是 cudaMemcpy）

Naturally your work also needs to be broken up in a separable way.当然，您的工作也需要以可分离的方式分解。 This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data.这通常意味着如果您的内核正在执行特定功能，您可能需要多次调用此内核，以便每次调用都可以处理单独的数据。 This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example.例如，这允许我们在第一次内核调用处理数据块 A 时将数据块 B 复制到设备。 In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.这样我们就有机会将数据块 B 的副本与数据块 A 的内核处理重叠。

The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:与 cudaMemcpyAsync（与 cudaMemcpy 相比）的主要区别在于：

It can be issued in any stream (it takes a stream parameter)它可以在任何流中发出（它需要一个流参数）
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.通常，它会立即将控制权返回给主机（就像内核调用一样），而不是等待数据复制完成。

Item 1 is a necessary feature so that data copy can be overlapped with kernel computation.第 1 项是必要的功能，以便数据副本可以与内核计算重叠。 Item 2 is a necessary feature so that data copy can be overlapped with host activity.第 2 项是一项必要功能，以便数据复制可以与主机活动重叠。

Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work.尽管复制/计算重叠的概念非常简单，但在实践中实现需要一些工作。 For additional references, please refer to:如需其他参考资料，请参阅：

Overlap copy/compute section of the CUDA best practices guide. CUDA 最佳实践指南的重叠复制/计算部分。
Sample code showing a basic implementation of copy/compute overlap .显示复制/计算重叠的基本实现的示例代码。
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario .显示完整的多/并发内核复制/计算重叠场景的示例代码。

Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (eg concurrent kernels).请注意，上面的一些讨论是基于具有 2.0 或更高计算能力的设备（例如并发内核）。 Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.此外，不同的设备可能有一个或两个复制引擎，这意味着同时复制到设备和从设备复制只能在某些设备上进行。