用于求解线性回归的梯度下降和正规方程方法给出了不同的解

Question

I'm working on machine learning problem and want to use linear regression as learning algorithm. 我正在研究机器学习问题，并希望使用线性回归作为学习算法。 I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. 我已经实现了2种不同的方法来找到线性回归模型的参数theta ：梯度（最陡）下降和法线方程。 On the same data they should both give approximately equal theta vector. 在相同的数据上，它们都应该给出近似相等的theta向量。 However they do not. 但他们没有。

Both theta vectors are very similar on all elements but the first one. 这两个theta矢量在所有元素上非常相似，但第一个。 That is the one used to multiply vector of all 1 added to the data. 这是用于将添加到数据中的所有1的向量相乘的那个。

Here is how the theta s look like (fist column is output of Gradient descent, second output of Normal equation): 这是theta s的样子（第一列是Gradient下降的输出，Normal方程的第二输出）：

Grad desc Norm eq
-237.7752 -4.6736
-5.8471   -5.8467
9.9174    9.9178
2.1135    2.1134
-1.5001   -1.5003
-37.8558  -37.8505
-1.1024   -1.1116
-19.2969  -19.2956
66.6423   66.6447
297.3666  296.7604
-741.9281 -744.1541
296.4649  296.3494
146.0304  144.4158
-2.9978   -2.9976
-0.8190   -0.8189

What can cause the difference in theta(1, 1) returned by gradient descent compared to theta(1, 1) returned by normal equation? 什么能引起差别theta(1, 1)返回的梯度下降相比， theta(1, 1)通过正规方程式回来了？ Do I have bug in my code? 我的代码中有错误吗？

Here is my implementation of normal equation in Matlab: 这是我在Matlab中实现的正规方程：

function theta = normalEque(X, y)
    [m, n] = size(X);
    X = [ones(m, 1), X];
    theta = pinv(X'*X)*X'*y;
end

Here is code for gradient descent: 这是梯度下降的代码：

function theta = gradientDesc(X, y)
    options = optimset('GradObj', 'on', 'MaxIter',  9999);
    [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),...
                    zeros(size(X, 2), 1), options);
end

function [J, grad] = cost(theta, X, y)
    m = size(X, 1);
    X = [ones(m, 1), X];
    J = sum((X * theta - y) .^ 2) ./ (2*m);
    for i = 1:size(theta, 1)
        grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m;
    end
end

I pass exactly the same data X and y to both functions (I do not normalize X ). 我将完全相同的数据X和y传递给两个函数（我没有规范化X ）。

Edit 1: 编辑1：

Based on answers and comments I checked few my code and run some tests. 基于答案和评论，我检查了几个代码并运行了一些测试。

First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497's answer . 首先，我想根据@ user1489497的回答，检查问题是否可能是由于单独的X beeing引起的。 So I replaced pinv by inv - and when run it I really got warning Matrix is close to singular or badly scaled. 所以我用inv替换了pinv - 当它运行时我真的得到了警告Matrix is close to singular or badly scaled. . 。 To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. 为了确保这不是问题，我获得了更大的数据集并使用这个新数据集运行测试。 This time inv(X) did not display the warning and using pinv and inv gave same results. 这次inv(X)没有显示警告，使用pinv和inv给出相同的结果。 So I hope that X is not close to singular any more . 所以我希望X不再接近单数 。

Then I changed normalEque code as suggested by woodchips so now it looks like: 然后， 我改变normalEque代码所建议的木片所以现在它看起来像：

function theta = normalEque(X, y)
    X = [ones(size(X, 1), 1), X];
    theta = pinv(X)*y;
end

However the problem is still there . 但问题仍然存在 。 New normalEque function on new data that are not close to singular gives different theta as gradientDesc . 对于不接近单数的新数据的新normalEque函数给出了与gradientDesc不同的theta 。

To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. 为了找出哪个算法是错误的，我在相同的数据上运行了数据挖掘软件Weka的线性回归算法。 Weka computed theta very similar to output of normalEque but different to the output of gradientDesc . Weka计算theta与normalEque输出非常相似，但与gradientDesc的输出不同。 So I guess that normalEque is correct and there is a bug in gradientDesc . 所以我想normalEque是正确的， gradientDesc有一个错误 。

Here is comparison of theta s computed by Weka, normalEque and GradientDesc : 这里是Weka， normalEque和GradientDesc计算的theta s的比较：

Weka(correct) normalEque    gradientDesc
779.8229      779.8163      302.7994
  1.6571        1.6571        1.7064
  1.8430        1.8431        2.3809
 -1.5945       -1.5945       -1.5964
  3.8190        3.8195        5.7486
 -4.8265       -4.8284      -11.1071
 -6.9000       -6.9006      -11.8924
-15.6956      -15.6958      -13.5411
 43.5561       43.5571       31.5036
-44.5380      -44.5386      -26.5137
  0.9935        0.9926        1.2153
 -3.1556       -3.1576       -1.8517
 -0.1927       -0.1919       -0.6583
  2.9207        2.9227        1.5632
  1.1713        1.1710        1.1622
  0.1091        0.1093        0.0084
  1.5768        1.5762        1.6318
 -1.3968       -1.3958       -2.1131
  0.6966        0.6963        0.5630
  0.1990        0.1990       -0.2521
  0.4624        0.4624        0.2921
-12.6013      -12.6014      -12.2014
 -0.1328       -0.1328       -0.1359

I also computed errors as suggested by Justin Peel's answer . 我还根据Justin Peel的回答计算出错误。 Output of normalEque gives slightly lesser squared error but the difference is marginal. normalEque输出给出略小的平方误差，但差异很小。 What is more when I compute gradient of cost of theta using function cost (the same as the one used by gradientDesc ) I got gradient near zero . 当我计算成本的梯度更重要的是theta使用功能cost （与所使用的一个gradientDesc ）我有梯度接近零 。 Same done on output of gradientDesc does not give gradient near zero. 在gradientDesc输出上完成相同不会产生接近零的梯度。 Here is what I mean: 这就是我的意思：

>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1));
>> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1));
>> disp([J_gd, J_ne])
  120.9932  119.1469
>> disp([grad_gd, grad_ne])
  -0.005172856743846  -0.000000000908598
  -0.026126463200876  -0.000000135414602
  -0.008365136595272  -0.000000140327001
  -0.094516503056041  -0.000000169627717
  -0.028805977931093  -0.000000045136985
  -0.004761477661464  -0.000000005065103
  -0.007389474786628  -0.000000005010731
   0.065544198835505  -0.000000046847073
   0.044205371015018  -0.000000046169012
   0.089237705611538  -0.000000046081288
  -0.042549228192766  -0.000000051458654
   0.016339232547159  -0.000000037654965
  -0.043200042729041  -0.000000051748545
   0.013669010209370  -0.000000037399261
  -0.036586854750176  -0.000000027931617
  -0.004761447097231  -0.000000027168798
   0.017311225027280  -0.000000039099380
   0.005650124339593  -0.000000037005759
   0.016225097484138  -0.000000039060168
  -0.009176443862037  -0.000000012831350
   0.055653840638386  -0.000000020855391
  -0.002834810081935  -0.000000006540702
   0.002794661393905  -0.000000032878097

This would suggest that gradient descent simply did not converge to global minimum... But that is hardly the case as I run it for thousands of iterations. 这表明梯度下降根本没有收敛到全局最小值......但是，当我运行数千次迭代时，情况就不是这样了。 So where is the bug? 那么bug在哪里？

Answer 1

I finally had time to get back to this. 我终于有时间回到这里了。 There is no "bug". 没有“错误”。

If the matrix is singular, then there are infinitely many solutions. 如果矩阵是单数的，那么存在无限多的解。 You can choose any solution from that set, and get equally as good an answer. 您可以从该组中选择任何解决方案，并获得同样好的答案。 The pinv(X)*y solution is a good one that many like because it is the minimum norm solution. pinv（X）* y解是一个很好的解决方案，因为它是最小的范数解决方案。

There is NEVER a good reason to use inv(X)*y. 绝对没有充分的理由使用inv（X）* y。 Even worse, is to use inverse on the normal equations, thus inv(X'*X)*X'*y is simply numerical crap. 更糟糕的是，对正规方程使用逆，因此inv（X'* X）* X'* y只是数值废话。 I don't care who told you to use it, they are guiding you to the wrong place. 我不在乎是谁告诉你使用它，它们会引导你到错误的地方。 (Yes, it will work acceptably for problems that are well-conditioned, but most of the time you don't know when it is about to give you crap. So why use it?) （是的，对于条件良好的问题，它可以接受，但大多数时候你不知道什么时候会给你废话。所以为什么要使用它？）

The normal equations are in general a bad thing to do, EVEN if you are solving a regularized problem. 正常方程通常是一件坏事，即使您正在解决正则化问题。 There are ways to do that that avoid squaring the condition number of the system, although I won't explain them unless asked as this answer has gotten long enough. 有办法做到这一点，避免平方系统的条件数，虽然我不会解释它们，除非被问及这个答案已经足够长。

X\\y will also yield a result that is reasonable. X \\ y也会产生合理的结果。

There is ABSOLUTELY no good reason to throw an unconstrained optimizer at the problem, as this will yield results that are unstable, completely dependent on your starting values. 绝对没有理由在这个问题上抛出一个无约束的优化器，因为这会产生不稳定的结果，完全取决于你的起始值。

As an example, I'll start with a singular problem. 举个例子，我将从一个单一的问题开始。

X = repmat([1 2],5,1);
y = rand(5,1);

>> X\y
Warning: Rank deficient, rank = 1, tol =  2.220446e-15. 
ans =
                         0
         0.258777984694222

>> pinv(X)*y
ans =
         0.103511193877689
         0.207022387755377

pinv and backslash return slightly different solutions. pinv和反斜杠返回略有不同的解决方案。 As it turns out, there is a basic solution, to which we can add ANY amount of the nullspace vector for the row space of X. 事实证明，有一个基本的解决方案，我们可以为X的行空间添加任意数量的零空间向量。

null(X)
ans =
         0.894427190999916
        -0.447213595499958

pinv generates the minimum norm solution. pinv生成最小范数解。 Of all of the solutions that might have resulted, this one has minimum 2-norm. 在可能产生的所有解决方案中，这个解决方案至少具有2个标准。

In contrast, backslash generates a solution that will have one or more variables set to zero. 相反，反斜杠生成的解决方案将一个或多个变量设置为零。

But if you use an unconstrained optimizer, it will generate a solution that is completely dependent on your starting values. 但是如果使用无约束优化器，它将生成一个完全取决于起始值的解决方案。 Again, ANLY amount of that null vector can be added to your solution, and you still have an entirely valid solution, with the same value of the sum of squares of errors. 同样，可以将任意数量的空向量添加到您的解决方案中，并且您仍然拥有完全有效的解决方案，并且具有相同的误差平方值。

Note that even though no singularity waring is returned, this need not mean your matrix is not close to singular. 请注意，即使没有返回奇点，也不一定意味着你的矩阵不接近单数。 You have changed little about the problem, so it is STILL close, just not enough to trigger the warning. 你对这个问题几乎没有什么改变，所以它仍然很接近，只是不足以触发警告。

Answer 2

As others mentioned, an ill-conditioned hessian matrix is likely the cause of your problem. 正如其他人所提到的，病态的粗麻布矩阵可能是导致问题的原因。

The number of steps that standard gradient descent algorithms take to reach a local optimum is a function of the largest eigenvalue of the hessian divided by the smallest (this is known as the condition number of the Hessian). 标准梯度下降算法达到局部最优所采取的步数是由粗糙度的最大特征值除以最小值的函数（这被称为Hessian的条件数）。 So, if your matrix is ill-conditioned, then it could take an extremely large number of iterations for gradient descent to converge to an optimum. 因此，如果您的矩阵是病态的，那么可能需要进行大量的迭代才能使梯度下降收敛到最优。 (For the singular case, it could converge to many points, of course.) （对于单一的情况，它当然可以收敛到许多点。）

I would suggest trying three different things to verify that an unconstrained optimization algorithm works for your problem (which it should): 1) Generate some synthetic data by computing the result of a known linear function for random inputs and adding a small amount of gaussian noise. 我建议尝试三种不同的方法来验证无约束优化算法适用于您的问题（它应该）：1）通过计算随机输入的已知线性函数的结果并添加少量高斯噪声来生成一些合成数据。 Make sure that you have many more data points than dimensions. 确保您拥有的数据点多于维度。 This should produce a non-singular hessian. 这应该产生非奇异的粗麻布。 2) Add a regularization terms to your error function to increase the condition number of the hessian. 2）向误差函数添加正则化项以增加粗体的条件数。 3) Use a second order method like conjugate gradient or L-BFGS rather than gradient descent to reduce the number of steps needed for the algorithm to converge. 3）使用二阶方法，如共轭梯度或L-BFGS而不是梯度下降，以减少算法收敛所需的步骤数。 (You will probably need to do this in conjunction with #2). （您可能需要与＃2一起执行此操作）。

Answer 3

Could you post a little more about what you X looks like? 你能发布一些关于你X的样子吗？ You're using pinv() which is Moore-Penrose pseudo inverse. 你使用的是pinv（），它是Moore-Penrose伪逆。 If the matrix is ill-conditioned this could cause problems with obtaining the inverse. 如果矩阵病态，这可能导致获得逆的问题。 I would bet that the gradient-descent method is closer to the mark. 我敢打赌，梯度下降法更接近标记。

Answer 4

You should see which method is actually giving you the smallest error. 您应该看到哪种方法实际上给您最小的错误。 That will indicate which method is struggling. 这将表明哪种方法正在挣扎。 I suspect that the normal equation method is the troubled solution because if X is ill-conditioned then you can have some problems there. 我怀疑正规方程方法是困难的解决方案，因为如果X病态，那么你可能会遇到一些问题。

You should probably replace your normal equation solution with theta = X\\y which will use a QR-decomposition method to solve it. 您应该用theta = X\\y替换正规方程解，它将使用QR分解方法来解决它。

用于求解线性回归的梯度下降和正规方程方法给出了不同的解

问题描述

Edit 1: 编辑1：

4 个解决方案

解决方案1
6 已采纳

解决方案2
2 2012-07-02 17:15:15

解决方案3
1 2012-06-30 05:46:19

解决方案4
0 2012-06-30 05:36:14

用于求解线性回归的梯度下降和正规方程方法给出了不同的解

问题描述

Edit 1: 编辑1：

4 个解决方案

解决方案1 6 已采纳

解决方案2 2 2012-07-02 17:15:15

解决方案3 1 2012-06-30 05:46:19

解决方案4 0 2012-06-30 05:36:14

解决方案1
6 已采纳

解决方案2
2 2012-07-02 17:15:15

解决方案3
1 2012-06-30 05:46:19

解决方案4
0 2012-06-30 05:36:14