简体   繁体   English


[英]Why is solving system of linear equations using cula(dgesv) slower than mkl (dgesv) for small data sets

I have written a CUDA C and C program to solve a matrix equation Ax=b using CULA routine dgesv and MKL routine dgesv. 我已经编写了CUDA C和C程序,以使用CULA例程dgesv和MKL例程dgesv求解矩阵方程Ax = b。 It seems like for a small data set, the CPU program is faster than the GPU program. 对于较小的数据集,CPU程序似乎比GPU程序快。 But the GPU overcomes the CPU as the data set increases past 500. I am using my dell laptop which has i3 CPU and Geforce 525M GPU. 但是随着数据集增加到500个以上,GPU克服了CPU的困扰。我使用的戴尔笔记本电脑具有i3 CPU和Geforce 525M GPU。 What is the best explanation for the initial slow performance of the GPU? 对于GPU最初的缓慢性能的最佳解释是什么?

I wrote another program which takes two vectors, multiplies them and add the result. 我编写了另一个程序,该程序接受两个向量,将它们相乘并相加。 This is just like the dot product just that the result is a vector sum not a scalar. 就像点积一样,只是结果是矢量和而不是标量。 In this program, the GPU is faster than the CPU even for small data set. 在此程序中,即使对于较小的数据集,GPU也比CPU更快。 I am using the same notebook. 我正在使用同一笔记本。 Why is the GPU faster in this program even for small data set as compared to the one explained above? 为什么与上面解释的数据集相比,即使对于较小的数据集,此程序中的GPU为何速度更快? Is it because there is not much computation involved in the summation? 是因为求和中没有太多的计算吗?

It's not uncommon for GPUs to be less interesting on small data sets as compared to large data sets. 与大数据集相比,GPU在小数据集上的吸引力降低并不少见。 The reasons for this will vary depending on the specific algorithm. 其原因将取决于特定算法。 GPUs generally have a higher main memory bandwidth than CPUs and also can usually outperform them for heavy-duty number crunching. GPU通常具有比CPU更高的主内存带宽,并且在繁重的数字运算方面通常也能胜过它们。 But GPUs usually only work well when there is parallelism inherent in the problem, which can be exposed. 但是,GPU通常仅在问题固有的并行性可以暴露的情况下才能正常工作。 Taking advantage of this parallelism allows an algorithm to tap into the greater memory bandwidth as well as the higher compute capability. 利用这种并行性,算法可以利用更大的内存带宽以及更高的计算能力。

However, before the GPU can do anything, it's necessary to get the data to the GPU. 但是,在GPU不能执行任何操作之前,有必要将数据发送到GPU。 And this creates a "cost" to the GPU version of the code that will not normally be present in the CPU version. 这就给GPU版本的代码带来了“成本”,而这些成本通常不会出现在CPU版本中。

To be more precise, the GPU will provide a benefit when the reduction in computation time on the GPU (over the CPU) exceeds the cost of the data transfer. 更准确地说,当GPU上的计算时间(通过CPU)的减少超过数据传输的成本时,GPU将提供好处。 I believe that solving a system of linear equations is somewhere between O(n^2) and O(n^3) complexity. 我认为求解线性方程组的复杂度介于O(n ^ 2)和O(n ^ 3)之间。 For very small n, this computational complexity may not be large enough to offset the cost of data transfer. 对于非常小的n,此计算复杂度可能不足以抵消数据传输的成本。 But clearly as n becomes larger it should. 但是很明显,随着n变大,它应该变大。 On the other hand your vector operation may only be O(n) complexity. 另一方面,向量运算可能只是O(n)复杂度。 So the benefit scenario will look different. 因此,收益方案将有所不同。

For the O(n^2) or O(n^3) case, as we move to larger data sets, the "cost" to transfer the data increases as O(n), but the compute requirements for solution increase as O(n^2) (or O(n^3)). 对于O(n ^ 2)或O(n ^ 3)情况,随着我们移至更大的数据集,传输数据的“成本”随着O(n)的增加而增加,但求解的计算需求随着O(n)的增加而增加。 n ^ 2)(或O(n ^ 3))。 Therefore larger data sets should have exponentially larger compute workloads, reducing the effect of the "cost" of the data transfer. 因此,较大的数据集应具有成倍增大的计算工作负载,从而减少数据传输“成本”的影响。 An O(n) problem on the other hand, probably won't have this scaling dynamic. 另一方面,O(n)问题可能没有这种缩放动态。 The workload increases at the same rate as the "cost" of data transfer. 工作负载的增长速度与数据传输的“成本”相同。

Also note that if the "cost" of transferring data to the GPU can be hidden by overlapping it with computation work, then the "cost" for the overlapped portion becomes "free", ie it does not contribute to the overall solution time. 还要注意,如果可以通过将数据传输到GPU的“成本”与计算工作重叠起来而被隐藏,则重叠部分的“成本”将变为“免费”,即,它不会对整体求解时间造成影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM