简体   繁体   English

测量CUDA分配时间

[英]Measuring CUDA Allocation time

I need to measure the time difference between allocating normal CPU memory with new and a call to cudaMallocManaged . 我需要测量使用new分配正常的CPU内存和调用cudaMallocManaged之间的时间差。 We are working with unified memory and are trying to figure out the trade-offs of switching things to cudaMallocManaged . 我们正在使用统一内存,并试图找出将事物切换到cudaMallocManaged的权衡。 (The kernels seem to run a lot slower, likely due to a lack of caching or something.) (内核的运行速度似乎慢得多,可能是由于缺少缓存或其他原因。)

Anyway, I am not sure the best way to time these allocations. 无论如何,我不确定计时这些分配的最佳方法。 Would one of boost's process_real_cpu_clock , process_user_cpu_clock , or process_system_cpu_clock give me the best results? boost的process_real_cpu_clockprocess_user_cpu_clockprocess_system_cpu_clock能给我最好的结果吗? Or should I just use the regular system time call in C++11? 还是我应该只使用C ++ 11中的常规系统时间调用? Or should I use the cudaEvent stuff for timing? 还是应该将cudaEvent用作计时?

I figure that I shouldn't use the cuda events, because they are for timing GPU processes and would not be acurate for timing cpu calls (correct me if I am wrong there.) If I could use the cudaEvents on just the mallocManaged one, what would be most accurate to compare against when timing the new call? 我认为我不应该使用cuda事件,因为它们用于计时GPU进程,而不是用于计时cpu调用的时间(如果我错了,请改正我。)如果我可以仅在mallocManaged事件上使用cudaEvents,在对new呼叫进行计时时,最精确的比较是什么? I just don't know enough about memory allocation and timing. 我只是对内存分配和时序不了解。 Everything I read seems to just make me more confused due to boost's and nvidia's shoddy documentation. 由于Boost和nvidia的伪劣文档,我阅读的所有内容似乎只会让我更加困惑。

You can use CUDA events to measure the time of functions executed in the host . 您可以使用CUDA事件来衡量在主机中执行功能的时间。

cudaEventElapsedTime computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds). cudaEventElapsedTime计算两个事件之间的经过时间(以毫秒为单位,分辨率约为0.5微秒)。

Read more at: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html 有关更多信息,请访问: http : //docs.nvidia.com/cuda/cuda-runtime-api/index.html

In addition, if you are also interested in timing your kernel execution time, you will find that the CUDA event API automatically blocks the execution of your code and waits until any asynchronous call ends (like a kernel call). 此外,如果您还对计时内核执行时间感兴趣,您会发现CUDA事件API自动阻止了代码的执行,并等待直到所有异步调用结束(例如内核调用)。

In any case, you should use the same metrics (always CUDA events, or boost, or your own timing) to ensure the same resolution and overhead. 在任何情况下,您都应使用相同的指标(始终为CUDA事件,加速或您自己的计时),以确保相同的分辨率和开销。

The profiler `nvprof' shipped with the CUDA toolkit may help to understand and optimize the performance of your CUDA application. CUDA工具包随附的探查器`nvprof'可能有助于了解和优化CUDA应用程序的性能。

Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html 有关更多信息,请访问: http : //docs.nvidia.com/cuda/profiler-users-guide/index.html

I recommend: 我建议:

auto t0 = std::chrono::high_resolution_clock::now();
// what you want to measure
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";

This will output the difference in seconds represented as a double . 这将以秒为单位输出以double表示的时差。

Allocation algorithms usually optimize themselves as they go along. 分配算法通常会随着过程的进行自我优化。 That is, the first allocation is often more expensive than the second because caches of memory are created during the first in anticipation of the second. 也就是说,第一个分配通常比第二个分配昂贵,因为在第一个分配的过程中会在第一个分配的过程中创建内存缓存。 So you may want to put the thing you're timing in a loop, and average the results. 因此,您可能需要将要计时的东西放在一个循环中,然后平均结果。

Some implementations of std::chrono::high_resolution_clock have been less than spectacular, but are improving with time. std::chrono::high_resolution_clock某些实现虽然不那么出色,但是随着时间的推移而不断改进。 You can assess your implementation with: 您可以使用以下方法评估实施情况:

auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";

That is, how fast can your implementation get the current time? 也就是说,您的实现可以多快获得当前时间? If it is slow, two successive calls will demonstrate a large time in-between. 如果速度较慢,则两个连续的调用之间将显示很长的时间。 On my system (at -O3) this outputs on the order of: 在我的系统上(在-O3处),此输出的顺序为:

1.2e-07s

which means I can time something that takes on the order of 1 microsecond. 这意味着我可以计时大约1微秒的时间。 To get a finer measurement than that I have to loop over many operations, and divide by the number of operations, subtracting out the loop overhead if that would be significant. 为了获得更好的测量结果,我必须遍历许多操作,然后除以操作数,如果这很重要,则应减去循环开销。

If your implementation of std::chrono::high_resolution_clock appears to be unsatisfactory, you may be able to build your own chrono clock along the lines of this . 如果您对std::chrono::high_resolution_clock似乎不令人满意,则可以按照this的方式构建自己的chrono时钟。 The disadvantage is obviously a bit of non-portable work. 缺点显然是一些不可移植的工作。 However you get the std::chrono duration and time_point infrastructure for free (time arithmetic and units conversion). 但是,您可以免费获得std::chrono time_point durationtime_point基础结构(时间算术和单位转换)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM