Matlab + CUDA在求解矩阵向量方程A * x = B方面很慢

Question

I am calculating an equation A*x=B, where A is a matrix and B is a vector, x is answer (unknown) vector. 我正在计算方程A * x = B，其中A是矩阵，B是矢量，x是答案（未知）矢量。

Hardware specs: Intel i7 3630QM (4 cores), nVidia GeForce GT 640M (384 CUDA cores) 硬件规格：Intel i7 3630QM（4核），nVidia GeForce GT 640M（384 CUDA核心）

Here's an example: 这是一个例子：

>> A=rand(5000);

>> B=rand(5000,1);

>> Agpu=gpuArray(A);

>> Bgpu=gpuArray(B);

>> tic;A\B;toc;

Elapsed time is 1.382281 seconds.

>> tic;Agpu\Bgpu;toc;

Elapsed time is 4.775395 seconds.

Somehow GPU is much slower... Why? 不知何故GPU慢得多......为什么？ It is also slower in FFT, INV, LU calculations, which should be related with matrix division. 它在FFT，INV，LU计算中也较慢，这应该与矩阵划分有关。

However, GPU is much faster in matrix multiplication (the same data): 但是，GPU在矩阵乘法（相同的数据）中要快得多：

>> tic;A*B;toc;

Elapsed time is 0.014700 seconds.

>> tic;Agpu*Bgpu;toc;

Elapsed time is 0.000505 seconds.

The main question is why GPU A\\B (mldivide) is so slow comparing to CPU? 主要问题是为什么GPU A \\ B（mldivide）与CPU相比如此之慢？

UPDATED 更新

Here are some more results when A, B (on CPU), AA, BB (on GPU) are rand(5000): 当A，B（在CPU上），AA，BB（在GPU上）为rand（5000）时，这里有更多结果：

>> tic;fft(A);toc;
Elapsed time is *0.117189 *seconds.
>> tic;fft(AA);toc;
Elapsed time is 1.062969 seconds.
>> tic;fft(AA);toc;
Elapsed time is 0.542242 seconds.
>> tic;fft(AA);toc;
Elapsed time is *0.229773* seconds.
>> tic;fft(AA);toc;

Bold times are stable times. 大胆的时代是稳定的时期。 However GPU is almost twice slower. 然而GPU几乎慢了两倍。 By the way, why GPU is even more slower on first two attempts? 顺便说一句，为什么GPU在前两次尝试中更慢？ Is it compiled twice firstly? 先编译两次吗？

In addition: 此外：

>> tic;sin(A);toc;
Elapsed time is *0.121008* seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.020448 seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.157209 seconds.
>> tic;sin(AA);toc;
Elapsed time is *0.000419 *seconds

After two calculations GPU is incredibly faster in sin calculations. 在两次计算之后，GPU在罪计算中的速度非常快。

So, still, why GPU is so slow in matrix division, fft and similar calculations, though it is so fast in matrix multiplication and trigonometry? 那么，为什么GPU在矩阵除法，fft和类似的计算中如此缓慢，尽管它在矩阵乘法和三角函数中如此之快？ The question actually should not be like that... GPU should be faster in all these calculations because Matlab has released overlapped functions (mldivide, fft) for GPU. 问题实际上不应该是这样......在所有这些计算中GPU应该更快，因为Matlab已经发布了GPU的重叠函数（mldivide，fft）。

Could somebody help me solve these issues, please? 请问有人帮我解决这些问题吗？ :) :)

Answer 1

Please read how Matlab calculates the solutions. 请阅读Matlab如何计算解决方案。 It will help you understand why GPU is slower. 它将帮助您理解为什么GPU速度较慢。

I'll try say it in few words. 我会试着用几句话说出来。

A*x=b becomes L*(U*x=y)=b, L*U=A A * x = b变为L *（U * x = y）= b，L * U = A.

So Matlab makes A to L*U (This process cannot be done fully parallel as far as I know instead some steps can be done parallel, due to their nature) 所以Matlab将A转换为L * U（据我所知，这个过程不能完全并行完成，而是由于它们的性质，一些步骤可以并行完成）
Then Matlab solves L*y=B and finds y. 然后Matlab求解L * y = B并找到y。 (This process cannot be done parallel as each step requires data from previous) （此过程不能并行完成，因为每个步骤都需要先前的数据）
Then Matlab solves U*x=y and finds x. 然后Matlab求解U * x = y并找到x。 (This process cannot be done parallel as each step requires data from previous) （此过程不能并行完成，因为每个步骤都需要先前的数据）

So it GPU clock is slower than the CPU, and since processes cannot be done parallel, CPU is faster. 因此GPU时钟比CPU慢，并且由于进程不能并行完成，因此CPU速度更快。 And no, unless you come up with a better method (good luck!) then GPU will be always slower except in some very specific cases. 不，除非你提出一个更好的方法（祝你好运！）然后GPU将总是较慢，除非在一些非常具体的情况下。

Answer 2

Part 1 of the explanation is in the answer from user2230360, but your question is twofold, so I'll add a bit about the multiplication. 解释的第1部分来自user2230360的答案，但你的问题是双重的，所以我将添加一些关于乘法的内容。

As noted already, the LU factorization is not very easily parallelized even if some steps can be. 如前所述，即使有一些步骤，LU分解也不是很容易并行化。 Matrix multiplication, however, is very much parallelizable. 然而，矩阵乘法是非常可并行化的。 If you're working with these things you should be able to do matrix multiplication by hand, and then you will know that calculating the elements of the matrix C in A*B=C can be done in any order you want - hence the possibility for parallel computation. 如果您正在使用这些东西，您应该能够手工进行矩阵乘法，然后您就会知道在A * B = C中计算矩阵C的元素可以按您想要的任何顺序进行 - 因此可能性用于并行计算。 That is probably why you're seeing so lightning fast multiplication, but slow solving of linear systems. 这可能就是为什么你看到闪电般的快速乘法，但线性系统的解决速度很慢。 One can't be parallelized "as much as the other". 一个人不能“和另一个人一样多”并行化。

Matlab + CUDA在求解矩阵向量方程A * x = B方面很慢

问题描述

2 个解决方案

解决方案1
4 2013-04-14 23:03:47

解决方案2
1 2014-01-22 08:59:53

Matlab + CUDA在求解矩阵向量方程A * x = B方面很慢

问题描述

2 个解决方案

解决方案1 4 2013-04-14 23:03:47

解决方案2 1 2014-01-22 08:59:53

解决方案1
4 2013-04-14 23:03:47

解决方案2
1 2014-01-22 08:59:53