[英]Calculating performance of CUFFT
I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. 我在划分为多个GPU的块(N * N / p)上运行CUFFT,我对计算性能有疑问。 First, a bit about how I am doing it:
首先,关于我的做法:
Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / execution time
and Execution time is calculated as: 执行时间的计算公式为:
execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)
Is this the correct way to evaluate CUFFT performance on multiple GPUs? 这是评估多个GPU上CUFFT性能的正确方法吗? Is there any other way I could represent the performance of FFT?
还有其他方法可以代表FFT的性能吗?
Thanks. 谢谢。
If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. 如果要执行复杂的转换,则操作计数是正确的(对于实值转换,它应该为2.5 N log2(N)),但是GFLOP公式不正确。 In a parallel, multiprocessor operation the usual calculation of throughput is
在并行的多处理器操作中,通常的吞吐量计算为
operation count / wall clock time
In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this: 在您的情况下,假设GPU并行运行,请测量执行时间的挂钟时间(即整个操作花费了多长时间),或使用以下方法:
execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)
As it stands, your calculation represents the serial execution time. 就目前而言,您的计算代表了串行执行时间。 Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.
考虑到multigpu方案的开销,我希望您所获得的计算出的性能数字将低于在单个GPU上完成的等效转换。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.