线性回归中的正态方程将theta系数返回为'NaN'

Question

I am trying to do linear regression using normal equation method. 我正在尝试使用正态方程方法进行线性回归。 In my data I have n = 143 features and m = 13000 training examples. 在我的数据中，我具有n = 143个特征和m = 13000个训练示例。 I know that normal equation method is not recommended when number of features greater than 10000. But I have only 143 features. 我知道当特征数量大于10000时不建议使用正态方程法。但是我只有143个特征。 My code return 'nan' as my array of thetas (linear coefficients). 我的代码返回'nan'作为我的theta数组（线性系数）。

In my csv file data presented without headers. 在我的csv文件中，数据没有标题。 So my data in csv file looks like this (only first 15 training examples and without column of ones yet): 所以我在csv文件中的数据看起来像这样（仅前15个训练示例，而没有一列示例）：

2;1;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;3;0;1;0;0;0;0;0;1986;9;1;16;5;1;1.65;1;0;0;0;4;2;1;0;0;0;1;1;0;0;0;0;2.8;1;0;15000
2;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;6;0;0;1;0;0;0;0;2006;8;0;23;5;2;1.65;1;0;0;0;2;2.23;1;0;0;0;1;1;0;0;0;0;2.79;1;0;12900
1;1;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;4;0;1;0;0;0;0;0;1987;6;0;29;6;2;1;0;1;0;0;2;1;0;1;0;0;2.12;0;1;0;0;0;2.8;3;0;23438
2;1;0;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0;1;0;0;0;2009;3;0;56;5;3;1;1;0;0;0;4;2;1;0;0;0;2;1;0;0;0;0;2.79;1;0;67000
1;1;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;10;0;1;0;0;0;0;0;1978;5;1;115;6;2;2;1;0;0;0;4;2;1;0;0;0;3;0;1;0;0;0;2.8;3;0;230000
3;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;6;0;0;1;0;0;0;0;2006;7;0;250;4.93;4;4;1;0;0;0;3.91;2.23;0;0;1;0;2.12;0;0;1;0;0;3;2;0;224000
1;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;8;0;0;1;0;0;0;0;2007;3;0;31;5;2;1;1;0;0;0;3.91;2.23;0;1;0;0;2.12;0;1;0;0;0;2.79;1;0;45000
1;1;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;4;0;1;0;0;0;0;0;1975;8;1;31;6;3;2;1;0;0;0;4;2;1;0;0;0;2;0;1;0;0;0;2.79;2;0;66000
1;1;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;5;0;0;0;1;0;0;0;1992;1;1;32;5;2.52;1.65;0;1;0;0;3.91;2.23;0;1;0;0;2.12;0;0;1;0;0;2.79;1;0;34000
1;1;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;3;0;0;1;0;0;0;0;2012;16;1;32;5;2;2;1;0;0;0;4;2;1;0;0;0;2;1;0;0;0;0;2.79;1;0;36000
2;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;8;0;1;0;0;0;0;0;1977;3;0;33;6;2;1.65;1;0;0;0;4;2.23;0;1;0;0;2.12;0;1;0;0;0;2.79;1;0;38000
2;1;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;8;0;0;1;0;0;0;0;2007;3;0;33;4.93;2;1;1;0;0;0;4;2.23;0;1;0;0;2.12;1;0;0;0;0;2.79;2;0;37000
1;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;8;0;1;0;0;0;0;0;1990;3;0;33;5;2;1;1;0;0;0;4;2;1;0;0;0;2;1;0;0;0;0;2.79;1;0;38000
2;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;8;0;0;1;0;0;0;0;2012;4;0;33;5;2;2;1;0;0;0;4;4;1;0;0;0;2;1;0;0;0;0;2.79;1;0;45000
3;1;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;7;0;0;0;0;1;0;0;1982;1;1;35;5;2;1.65;1;0;0;0;4;2.23;0;0;0;1;2;1;0;0;0;0;2.7;1;0;45000

Note: The reason why data contain so many zeros and ones is because I used dummy coding for some features. 注意： 数据包含这么多的零和一的原因是因为我对某些功能使用了伪编码。 Some features have decent number of classes. 一些功能具有不错的类数。

Python Code: Python代码：

import pandas as pd
import numpy as np

path = 'DB2.csv'  
data = pd.read_csv(path, header=None, delimiter=";")

data.insert(0, 'Ones', 1)

print np.linalg.cond(data)
print np.linalg.matrix_rank(data)

cols = data.shape[1] 
X = data.iloc[:,0:cols-1]  
y = data.iloc[:,cols-1:cols] 

#Normal equation:
xTx = X.T.dot(X)
XtX = np.linalg.inv(xTx)
XtX_xT = XtX.dot(X.T)
theta = XtX_xT.dot(y)

print theta

This formula used for normal equation: 该公式用于法线方程：

Output of the program (array of thetas): 程序输出（theta数组）：

[[ nan]
 [ nan]
 [ nan]
 [ nan]
 [ nan]
 ...
 [ nan]]

Also in the program I tried to check condition number of matrix by the code: 同样在程序中，我尝试通过代码检查矩阵的条件号 ：

print np.linalg.cond(data)

This line of code also returned 'nan' 这行代码还返回了'nan'

But this line of code for checking matrix rank : 但是这行代码用于检查矩阵等级 ：

print np.linalg.matrix_rank(data)

Returned 0 . 返回0 。

I need some clarification of what is going on. 我需要一些澄清。 I cannot figure out what is wrong and why I get nan . 我无法弄清楚哪里出了问题以及为什么我要nan 。

Answer 1

Something to watch out for when using dummy/indicator variables, and might be happening here: 使用虚拟/指标变量时需要注意的事情，可能在这里发生：

Including a constant vector + full indicators (or multiple categories with full indicators) creates a rank deficient data matrix 包含常数向量+完整指标（或具有完整指标的多个类别）会创建秩不足的数据矩阵

Imagine you have a dummy variable for night, a dummy variable for day, a dummy variable for snowing, and a dummy for not snowing. 假设您有一个晚上的虚拟变量，一个白天的虚拟变量，一个下雪的虚拟变量和一个不下雪的虚拟变量。 Your data may be something like: 您的数据可能类似于：

           I_day    I_night     I_snow     I_no_snow
obs 1:         1          0          1             0
obs 2:         0          1          1             0
obs 3:         1          0          0             1
obs 4:         0          1          0             1
etc...

A subtle but HORRIBLE error has been made, the data matrix is rank deficient ! 发生了细微但可怕的错误，数据矩阵秩不足 ！ I_day + I_night is always a vector of 1s and the same thing for I_snow + I_no_snow . I_day + I_night始终是1s的向量，并且I_snow + I_no_snow相同。 We have linear dependence: I_day+I_night = I_snow+I_no_snow ! 我们具有线性相关性： I_day+I_night = I_snow+I_no_snow ！ The data matrix is rank 3, not rank 4. X'*X will be rank 3 (instead of 4). 数据矩阵是等级3，而不是等级4。X'* X将是等级3（而不是4）。

What to do: 该怎么办：

If including a constant in the data matrix X, then for each categorical variable you always need to leave the dummy for 1 category out of the matrix X. (And the dummies will indicate an effect relative to this left out category). 如果在数据矩阵X中包含一个常数，那么对于每个类别变量，您总是需要将矩阵X中1类的虚拟对象保留。（并且虚拟变量将指示相对于此遗漏类别的效果）。

In this example, I could form my data matrix X as follows: 在此示例中，我可以如下形成数据矩阵X：

           const    I_day     I_snow 
obs 1:         1        1          1
obs 2:         1        0          1
obs 3:         1        1          0
obs 4:         1        0          0
etc...

If no constant is included, you can include full dummies for exactly one categorical variable. 如果不包含常量，则可以为一个类别变量仅包含完整的虚拟变量。

The basic idea is that you should only have 1 constant vector in your data matrix. 基本思想是，数据矩阵中应该只有1个常数向量。 Full dummies for 2+ categories is like including 2+ constant vectors in your data matrix. 2+类的完整虚拟变量就像在数据矩阵中包括2+个常数向量一样。

Answer 2

It would help to have the actual data to see what is really going on, but from what you describe, your data matrix, ie X is ill conditioned. 拥有实际数据可以查看实际情况，但是从您的描述来看，您的数据矩阵（即X处于病态）会有所帮助。 Consequently, the condition estimate returns NaN and your rank is 0. Therefore (X^T*X) cannot be inverted. 因此，条件估计返回NaN且您的等级为0。因此（X ^ T * X）无法求逆。 To solve this, you need to regularize, ie compute 为了解决这个问题，您需要进行正则化，即计算

(X^T*X+lambda * 1)^(-1)*X^T instead, where 1 is the identity matrix of appropriate dimensions and lambda is your regularization parameter. （X ^ T * X + lambda * 1）^（-1）* X ^ T，其中1是适当尺寸的单位矩阵，而lambda是您的正则化参数。

线性回归中的正态方程将theta系数返回为'NaN'

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-12-08 19:52:26

Including a constant vector + full indicators (or multiple categories with full indicators) creates a rank deficient data matrix 包含常数向量+完整指标（或具有完整指标的多个类别）会创建秩不足的数据矩阵

What to do: 该怎么办：

解决方案2
1 2015-12-08 18:18:48

线性回归中的正态方程将theta系数返回为&#39;NaN&#39;

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-12-08 19:52:26

Including a constant vector + full indicators (or multiple categories with full indicators) creates a rank deficient data matrix 包含常数向量+完整指标（或具有完整指标的多个类别）会创建秩不足的数据矩阵

What to do: 该怎么办：

解决方案2 1 2015-12-08 18:18:48

线性回归中的正态方程将theta系数返回为'NaN'

解决方案1
3 已采纳 2015-12-08 19:52:26

解决方案2
1 2015-12-08 18:18:48