简体   繁体   中英

Measuring the total time taken by the kernel when using streams

I am looking to analyse the total time spent on the kernels, running multiple time, and was wondering if this code would give me the total spend on the streamed kernels, or if time returned needed to be multiplied by the number of launches.

cudaEvent_t start, stop;    
cudaEventCreate(&start);
cudaEventCreate(&stop);


for(x=0; x<SIZE; x+=N*2){

     gpuErrchk(cudaMemcpyAsync(data_d0, data_h+x, N*sizeof(char), cudaMemcpyHostToDevice, stream0));
     gpuErrchk(cudaMemcpyAsync(data_d1, data_h+x+N, N*sizeof(char), cudaMemcpyHostToDevice, stream1));


     gpuErrchk(cudaMemcpyAsync(array_d0, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream0));
     gpuErrchk(cudaMemcpyAsync(array_d1, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream1));

     cudaEventRecord(start, 0);
        GPU<<<N/512,512,0,stream0>>>(array_d0, data_d0, out_d0 );
        GPU<<<N/512,512,0,stream1>>>(array_d1, data_d1, out_d1);
     cudaEventRecord(stop, 0);

     gpuErrchk(cudaMemcpyAsync(out_h+x, out_d0 , N * sizeof(int), cudaMemcpyDeviceToHost, stream0));
     gpuErrchk(cudaMemcpyAsync(out_h+x+N, out_d1 ,N *  sizeof(int), cudaMemcpyDeviceToHost, stream1));

} 

float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf("Time %f ms\n", elapsedTime);

It will not capture the total execution time for the kernels for all passes of the loop.

From the documentation :

If cudaEventRecord() has previously been called on event, then this call will overwrite any existing state in event. Any subsequent calls which examine the status of event will only examine the completion of this most recent call to cudaEventRecord().

If you believe that the execution time for each pass through the loop will be approximately the same, then you can just multiply the result by the number of passes.

Note that you should issue a cudaEventSynchronize() call on the stop event, before the call to cudaEventElapsedTime()

Event-based timing was added to CUDA to enable fine-grained timing of on-chip execution (for example, you should get an accurate time even if only one kernel invocation is bracketed by the event start/stop calls). But streams and out-of-order execution introduce ambiguity into the meaning of the "timestamp" recorded by cudaEventRecord() . cudaEventRecord() takes a stream parameter, and as far as I know it respects that stream parameter; but the stream's execution can be affected by other streams, eg if they are contending for some resource.

So it is best practice to call cudaEventRecord() on the NULL stream to serialize.

Interestingly, Intel has a similar history with RDTSC, where they introduced superscalar execution and timestamp recording in the same product. (For NVIDIA, it was CUDA 1.1; for Intel, it was the Pentium.) And similarly, Intel had to revise their guidance to developers who relied on RDTSC being a serializing instruction, telling them to serialize explicitly to get meaningful timing results.

Why isn't RDTSC a serializing instruction?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM