简体   繁体   中英

Measuring CUDA Allocation time

I need to measure the time difference between allocating normal CPU memory with new and a call to cudaMallocManaged . We are working with unified memory and are trying to figure out the trade-offs of switching things to cudaMallocManaged . (The kernels seem to run a lot slower, likely due to a lack of caching or something.)

Anyway, I am not sure the best way to time these allocations. Would one of boost's process_real_cpu_clock , process_user_cpu_clock , or process_system_cpu_clock give me the best results? Or should I just use the regular system time call in C++11? Or should I use the cudaEvent stuff for timing?

I figure that I shouldn't use the cuda events, because they are for timing GPU processes and would not be acurate for timing cpu calls (correct me if I am wrong there.) If I could use the cudaEvents on just the mallocManaged one, what would be most accurate to compare against when timing the new call? I just don't know enough about memory allocation and timing. Everything I read seems to just make me more confused due to boost's and nvidia's shoddy documentation.

You can use CUDA events to measure the time of functions executed in the host .

cudaEventElapsedTime computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).

Read more at: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html

In addition, if you are also interested in timing your kernel execution time, you will find that the CUDA event API automatically blocks the execution of your code and waits until any asynchronous call ends (like a kernel call).

In any case, you should use the same metrics (always CUDA events, or boost, or your own timing) to ensure the same resolution and overhead.

The profiler `nvprof' shipped with the CUDA toolkit may help to understand and optimize the performance of your CUDA application.

Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html

I recommend:

auto t0 = std::chrono::high_resolution_clock::now();
// what you want to measure
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";

This will output the difference in seconds represented as a double .

Allocation algorithms usually optimize themselves as they go along. That is, the first allocation is often more expensive than the second because caches of memory are created during the first in anticipation of the second. So you may want to put the thing you're timing in a loop, and average the results.

Some implementations of std::chrono::high_resolution_clock have been less than spectacular, but are improving with time. You can assess your implementation with:

auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";

That is, how fast can your implementation get the current time? If it is slow, two successive calls will demonstrate a large time in-between. On my system (at -O3) this outputs on the order of:

1.2e-07s

which means I can time something that takes on the order of 1 microsecond. To get a finer measurement than that I have to loop over many operations, and divide by the number of operations, subtracting out the loop overhead if that would be significant.

If your implementation of std::chrono::high_resolution_clock appears to be unsatisfactory, you may be able to build your own chrono clock along the lines of this . The disadvantage is obviously a bit of non-portable work. However you get the std::chrono duration and time_point infrastructure for free (time arithmetic and units conversion).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM