简体   繁体   English

测量使用流时内核占用的总时间

[英]Measuring the total time taken by the kernel when using streams

I am looking to analyse the total time spent on the kernels, running multiple time, and was wondering if this code would give me the total spend on the streamed kernels, or if time returned needed to be multiplied by the number of launches. 我想分析花费在内核上的总时间,要运行多个时间,并且想知道这段代码是否可以为我提供流式内核的总花费,还是返回的时间是否需要乘以启动次数。

cudaEvent_t start, stop;    
cudaEventCreate(&start);
cudaEventCreate(&stop);


for(x=0; x<SIZE; x+=N*2){

     gpuErrchk(cudaMemcpyAsync(data_d0, data_h+x, N*sizeof(char), cudaMemcpyHostToDevice, stream0));
     gpuErrchk(cudaMemcpyAsync(data_d1, data_h+x+N, N*sizeof(char), cudaMemcpyHostToDevice, stream1));


     gpuErrchk(cudaMemcpyAsync(array_d0, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream0));
     gpuErrchk(cudaMemcpyAsync(array_d1, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream1));

     cudaEventRecord(start, 0);
        GPU<<<N/512,512,0,stream0>>>(array_d0, data_d0, out_d0 );
        GPU<<<N/512,512,0,stream1>>>(array_d1, data_d1, out_d1);
     cudaEventRecord(stop, 0);

     gpuErrchk(cudaMemcpyAsync(out_h+x, out_d0 , N * sizeof(int), cudaMemcpyDeviceToHost, stream0));
     gpuErrchk(cudaMemcpyAsync(out_h+x+N, out_d1 ,N *  sizeof(int), cudaMemcpyDeviceToHost, stream1));

} 

float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf("Time %f ms\n", elapsedTime);

It will not capture the total execution time for the kernels for all passes of the loop. 它不会捕获循环所有遍历的内核总执行时间。

From the documentation : 文档中

If cudaEventRecord() has previously been called on event, then this call will overwrite any existing state in event. 如果先前已在事件上调用cudaEventRecord(),则此调用将覆盖事件中任何现有状态。 Any subsequent calls which examine the status of event will only examine the completion of this most recent call to cudaEventRecord(). 任何随后的检查事件状态的调用都只会检查对cudaEventRecord()的最新调用的完成。

If you believe that the execution time for each pass through the loop will be approximately the same, then you can just multiply the result by the number of passes. 如果您认为每次循环执行的时间都差不多,那么您可以将结果乘以循环次数。

Note that you should issue a cudaEventSynchronize() call on the stop event, before the call to cudaEventElapsedTime() 请注意,在调用cudaEventElapsedTime()之前,应在stop事件上发出cudaEventSynchronize()调用。

Event-based timing was added to CUDA to enable fine-grained timing of on-chip execution (for example, you should get an accurate time even if only one kernel invocation is bracketed by the event start/stop calls). 基于事件的计时已添加到CUDA中,以启用细粒度的片上执行计时(例如,即使事件启动/停止调用仅包含一个内核调用,您也应获得准确的时间)。 But streams and out-of-order execution introduce ambiguity into the meaning of the "timestamp" recorded by cudaEventRecord() . 但是流和乱序执行将模糊性引入了cudaEventRecord()记录的“时间戳”的含义。 cudaEventRecord() takes a stream parameter, and as far as I know it respects that stream parameter; 据我所知, cudaEventRecord()需要一个流参数,并且尊重该流参数。 but the stream's execution can be affected by other streams, eg if they are contending for some resource. 但是流的执行可能会受到其他流的影响,例如,如果它们争用某些资源。

So it is best practice to call cudaEventRecord() on the NULL stream to serialize. 因此,最佳实践是在NULL流上调用cudaEventRecord()进行序列化。

Interestingly, Intel has a similar history with RDTSC, where they introduced superscalar execution and timestamp recording in the same product. 有趣的是,英特尔在RDTSC上有着相似的历史,他们在同一产品中引入了超标量执行和时间戳记录功能。 (For NVIDIA, it was CUDA 1.1; for Intel, it was the Pentium.) And similarly, Intel had to revise their guidance to developers who relied on RDTSC being a serializing instruction, telling them to serialize explicitly to get meaningful timing results. (对于NVIDIA,它是CUDA 1.1;对于英特尔,它是奔腾。)同样,英特尔也不得不修改对依赖RDTSC作为序列化指令的开发人员的指导,告诉他们明确进行序列化以获得有意义的时序结果。

Why isn't RDTSC a serializing instruction? RDTSC为什么不是序列化指令?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM