对于固定数据大小，双精度CUDA代码比单精度CUDA代码快

Question

I have implemented an algorithm in CUDA and seems it's running faster with double precision than with single precision. 我已经在CUDA中实现了一种算法，似乎它以双精度比单精度运行得更快。

I know that usually single precision is faster in GPU. 我知道在GPU中通常单精度会更快。 My GPU is Nvidia Geforce GT 650M. 我的GPU是Nvidia Geforce GT 650M。

The algorithm pseudo code is the following: 算法伪代码如下：

for k to numIterations
    for j to numRowsOfAMatrix
        CUDAmemset(double arrayGPU)
        CUBLASdotproduct(double arrayGPU,double arrayGPU) [using cublasDdot]
        CUBLASdotproduct(double arrayGPU,double arrayGPU) [using cublasDdot]
        CUBLASscalarVectorMultiplication(scalarCPU,double arrayGPU) [using cublasDaxpy]
        CUBLASvectorSum(double arrayGPU,double arrayGPU) [using cublasDaxpy]
    end
end

I've run some tests with the following properties: Arrays are 2500 length. 我已经使用以下属性运行了一些测试：数组的长度为2500。 Matrix row lenght is 2700. 矩阵行长度是2700。

The times that I'm obtaining are the following: 我获得的时间如下：

50 iterations: 50次迭代：

20.9960 seconds for single 单身20.9960秒

20.1881 seconds for double 20.1881秒

200 iterations: 200次迭代：

81.9562 seconds for single 单身81.9562秒

78.9490 seconds for double 双倍78.9490秒

500 iterations: 500次迭代：

199.661 seconds for single 单身199.661秒

199.045 seconds for double 双打199.045秒

1000 iterations: 1000次迭代：

413.129 seconds for single 单身413.129秒

396.205 seconds for double 396.205秒，两倍

Any idea why double precision is faster? 知道为什么双精度更快吗？

Answer 1

I don't believe you can say that the double precision version is faster than the single precision version. 我不相信您可以说双精度版本比单精度版本快。 Your own timing shows both take about 20 seconds for 50 iterations and about 200 seconds for 500 iterations. 您自己的时序显示，两次迭代大约需要20秒，而迭代500次大约需要200秒。 The question then becomes why? 问题变成了为什么？

To me it just looks like your code is dominated by API and PCI-e bus latency. 对我来说，您的代码似乎受API和PCI-e总线延迟的支配。 Even the two times memory bandwidth difference between single and double precision is probably irrelevant in this case. 在这种情况下，即使单精度和双精度之间的两倍内存带宽差异也可能无关紧要。 If each array is only about 2500 long, then the arithmetic and device memory transaction portions of the calculation will be absolutely tiny compared to the overall execution time. 如果每个阵列只有大约2500长，那么与整个执行时间相比，计算的算术和设备存储器事务部分将绝对很小。

Looking at your pseudocode shows why. 查看您的伪代码可以说明原因。 At each iteration, the two dot calls have launch one or more kernels, wait for them to finish, then download a scalar result from the device. 在每次迭代中，两个点调用都会启动一个或多个内核，等待它们完成，然后从设备下载标量结果。 Then scalars have to be uploaded to the device for each axpy call followed by a kernel launch. 然后，必须为每个axpy调用将标量上载到设备，然后启动内核。 From the information in comments, this means you code performs perhaps two blocking memory copies and six kernel launches per input row, and there are 2700 input rows per iteration. 根据注释中的信息，这意味着您的代码可能每个输入行执行两个阻塞内存副本和六个内核启动，并且每个迭代有2700个输入行。 That means you code is performing 10-15 thousand GPU API calls per iteration, which is a lot of transactions and API latency (especially if you are doing this on a WDDM Windows platform) for a nothing more than a few thousand FLOPs and a few tens of kb of GPU memory access per row. 这意味着您的代码每次迭代执行105,000个 GPU API调用，这涉及大量事务和API延迟（尤其是如果您在WDDM Windows平台上执行此操作），仅需几千个FLOP和几个每行数十kb的GPU内存访问。

The fact that your GPU has 12 times higher peak single precision than double precision arithmetic throughput is irrelevant in this case, because the computation time is a vanishingly small fraction of the total wall clock time you measure. 在这种情况下，GPU的峰值单精度比双精度算术吞吐量高12倍这一事实是无关紧要的，因为计算时间仅占您所测量的总挂钟时间的一小部分。

Answer 2

The difference in computational cost between two algorithms (in your case, the single and double precision versions) is generally measured by the asymptotic computational complexity . 两种算法（在您的情况下为单精度和双精度版本）之间的计算成本差异通常由渐近计算复杂度来衡量。 It is not surprising that double precision can have the same performance as single precision for a fixed (small, in your case) vector length, for the reasons explained by talonmies (latency). 对于固定长度（在您的情况下较小）的矢量精度，双精度可以具有与单精度相同的性能，这并不奇怪，这是由talonmies（延迟）解释的。 To really state which algorithm is faster, you should analyze the timing against the vector length N , starting from small to large values of N . 为了真正规定哪一算法更快，你应该分析针对向量长度的定时N ，从小开始的大的值N 。

Another example, which however has nothing to do with GPGPU, is FFT, which has an asymptotic complexity of O(NlogN) and then is more convenient than "brute-force" summation of DFT, which as O(N^2) complexity. 然而，与GPGPU无关的另一个示例是FFT，它具有O(NlogN)的渐近复杂度，然后比DFT的“蛮力”求和更方便，后者是O(N^2)复杂度。 But, if you compare the timing between FFT and "brute-force" DFT summation for very low values of N , you will find that "brute-force" DFT summation will take the least time. 但是，如果比较非常低的N值的FFT和“蛮力” DFT求和之间的时序，您会发现“蛮力” DFT求和将花费最少的时间。

对于固定数据大小，双精度CUDA代码比单精度CUDA代码快

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-09-08 16:33:29

解决方案2
1 2013-09-08 20:22:13

对于固定数据大小，双精度CUDA代码比单精度CUDA代码快

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-09-08 16:33:29

解决方案2 1 2013-09-08 20:22:13

解决方案1
4 已采纳 2013-09-08 16:33:29

解决方案2
1 2013-09-08 20:22:13