简体   繁体   中英

Measuring Elapsed Time for an OpenCL Application

I know this question is asked several times, but in my application its critical to have the time right, so i might want to try again:

I calculate the time for a kernel Method like this, first for CPU Clock time with clock_t;

clock_t start = clock(); // Or std::chrono::system_clock::now() for WALL CLOCK TIME
openCLFunction();
clock_t end = clock; // Or std::chrono::system_clock::now() for WALL CLOCK TIME
double time_elapsed = start-end;

And my openCLFunction():

{
//some OpenCLKernelfunction
clFlush(queue);
clFinish(queue);
}

There is a big different in results between two method, and to be honest i dont know which is right, because they are in miliseconds. Can i trust the CPU clock time on this ? Is there a definitive way to measure without concerning about the results ?(Note that I call two functions to finish my kernel function.)

You should probably be using Kernel profiling.

cl_command_queue_properties properties[] {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, properties, &err);

/*Later...*/
cl_event event;
clEnqueueNDRangeKernel(queue, kernel, /*...*/, &event);
clWaitForEvents(1, &event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, nullptr);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, nullptr);

std::chrono::nanoseconds duration{end - start};

At the end of that code, duration contains the amount of nanoseconds (reported as precisely as the device is capable; note that many devices don't have sub-microsecond precision) that passed between the beginning and end of execution of the kernel.

There are (at least) 3 ways to time OpenCL/CUDA execution:

  1. Use of CPU timers + queue flushing
  2. Use of OpenCL / CUDA events
  3. Use of an external profiler tool (eg whatever AMD offers or nvprof for nVIDIA cards)

Your first example falls in the first category, but - you don't seem seem to be flushing the queues which the OpenCL function uses (I'm assuming that's a function enqueueing a kernel). So - unless the execution is somehow forced to be synchronous, what you would be measuring is the time it takes to enqueue the kernel and do whatever CPU-side work you do before or after that. That could explain the discrepancy with the clFlush/clFinish method.

Another reason for the discrepancy could be setup/tear-down work (eg memory allocation or run-time internal overhead) which your second method times and your first does not.

A final note is that all three methods will produce slightly different results due to either measurement inaccuracy or differences in the overheads required to make use of them. These differences may not be so slight if your kernels are small, though: In my experience, profiler-provided kernel execution times vs event-measured times, in CUDA and on nVIDIA Maxwell and Pascal cards can differ by dozens of microseconds. And the lessons of that fact are: 1. Try measuring on more data when relevant and possible, and normalizing by the amount of data. 2. Be consistent in how you measure execution times when making comparisons.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM