简体   繁体   English

statsmodel OLS 和 scikit-learn 线性回归的区别

[英]Difference between statsmodel OLS and scikit-learn linear regression

I tried to practice linear regression model with iris dataset.我尝试使用 iris 数据集练习线性回归 model。

from sklearn import datasets
import seaborn as sns
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

# load iris data
train = sns.load_dataset('iris')
train

# one-hot-encoding
species_encoded = pd.get_dummies(train["species"], prefix = "speceis")
species_encoded

train = pd.concat([train, species_encoded], axis = 1)
train

# Split by feature and target
feature = ["sepal_length", "petal_length", "speceis_setosa", "speceis_versicolor", "speceis_virginica"]
target  = ["petal_width"]

X_train = train[feature]
y_train = train[target]

case 1: statsmodels案例1:statsmodels

# model
X_train_constant = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_constant).fit() 
print("const : {:.6f}".format(model.params[0]))
print(model.params[1:])

result :
const : 0.253251
sepal_length         -0.001693
petal_length          0.231921
speceis_setosa       -0.337843
speceis_versicolor    0.094816
speceis_virginica     0.496278

case 2: scikit-learn案例2:scikit-learn

# model                          
model = LinearRegression()
model.fit(X_train, y_train)
print("const : {:.6f}".format(model.intercept_[0]))
print(pd.Series(model.coef_[0], model.feature_names_in_))

result :
const : 0.337668
sepal_length         -0.001693
petal_length          0.231921
speceis_setosa       -0.422260
speceis_versicolor    0.010399
speceis_virginica     0.411861 

Why are the results of statsmodels and sklearn different?为什么statsmodels和sklearn的结果不一样?

Additionally, the results of the two models are the same except for all or part of the one-hot-encoded feature.此外,除了全部或部分 one-hot-encoded 特征外,两个模型的结果是相同的。

You included a full set of one-hot encoded dummies as regressors, which results in a linear combination that is equal to the constant, therefore you have perfect multicollinearity: your covariance matrix is singular and you can't take its inverse.您包含了一整套单热编码虚拟变量作为回归量,这导致线性组合等于常数,因此您具有完美的多重共线性:您的协方差矩阵是奇异的,您不能取它的逆矩阵。

Under the hood both statsmodels and sklearn rely on Moore-Penrose pseudoinverse and can invert singular matrices just fine, the problem is that the coefficients obtained in the singular covariance matrix case don't mean anything in any physical sense.在后台, statsmodelssklearn都依赖于 Moore-Penrose 伪逆,并且可以很好地反转奇异矩阵,问题是在奇异协方差矩阵情况下获得的系数在任何物理意义上都没有任何意义。 The implementations differ a bit between packages ( sklearn relies on scipy.stats.lstsq , statsmodels has some custom procedure statsmodels.tools.pinv_extended , which is basically numpy.linalg.svd with minimal changes), so at the end of the day they both display «nonsense» (since no meaningful coefficients can be obtained), it's just a design choice of what kind of «nonsense» to display.包之间的实现略有不同( sklearn依赖于scipy.stats.lstsqstatsmodels有一些自定义过程statsmodels.tools.pinv_extended ,基本上是numpy.linalg.svd ,它们都在当天结束显示«废话»(因为无法获得有意义的系数),这只是显示什么样的«废话»的设计选择。

If you take the sum of coefficients of one-hot encoded dummies, you can see that for statsmodels it is equal to the constant, and for sklearn it is equal to 0, while the constant differs from statsmodels constant.如果你取 one-hot 编码假人的系数之和,你可以看到,对于statsmodels ,它等于常数,对于sklearn ,它等于 0,而常数与statsmodels常数不同。 The coefficients of variables that are not «responsible» for perfect multicollinearity are unaffected.对完美多重共线性不“负责任”的变量系数不受影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM