Difference between statsmodel OLS and scikit-learn linear regression

Question

I tried to practice linear regression model with iris dataset.

from sklearn import datasets
import seaborn as sns
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

# load iris data
train = sns.load_dataset('iris')
train

# one-hot-encoding
species_encoded = pd.get_dummies(train["species"], prefix = "speceis")
species_encoded

train = pd.concat([train, species_encoded], axis = 1)
train

# Split by feature and target
feature = ["sepal_length", "petal_length", "speceis_setosa", "speceis_versicolor", "speceis_virginica"]
target  = ["petal_width"]

X_train = train[feature]
y_train = train[target]

case 1: statsmodels

# model
X_train_constant = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_constant).fit() 
print("const : {:.6f}".format(model.params[0]))
print(model.params[1:])

result :
const : 0.253251
sepal_length         -0.001693
petal_length          0.231921
speceis_setosa       -0.337843
speceis_versicolor    0.094816
speceis_virginica     0.496278

case 2: scikit-learn

# model                          
model = LinearRegression()
model.fit(X_train, y_train)
print("const : {:.6f}".format(model.intercept_[0]))
print(pd.Series(model.coef_[0], model.feature_names_in_))

result :
const : 0.337668
sepal_length         -0.001693
petal_length          0.231921
speceis_setosa       -0.422260
speceis_versicolor    0.010399
speceis_virginica     0.411861

Why are the results of statsmodels and sklearn different?

Additionally, the results of the two models are the same except for all or part of the one-hot-encoded feature.

Answer 1

You included a full set of one-hot encoded dummies as regressors, which results in a linear combination that is equal to the constant, therefore you have perfect multicollinearity: your covariance matrix is singular and you can't take its inverse.

Under the hood both statsmodels and sklearn rely on Moore-Penrose pseudoinverse and can invert singular matrices just fine, the problem is that the coefficients obtained in the singular covariance matrix case don't mean anything in any physical sense. The implementations differ a bit between packages ( sklearn relies on scipy.stats.lstsq , statsmodels has some custom procedure statsmodels.tools.pinv_extended , which is basically numpy.linalg.svd with minimal changes), so at the end of the day they both display «nonsense» (since no meaningful coefficients can be obtained), it's just a design choice of what kind of «nonsense» to display.

If you take the sum of coefficients of one-hot encoded dummies, you can see that for statsmodels it is equal to the constant, and for sklearn it is equal to 0, while the constant differs from statsmodels constant. The coefficients of variables that are not «responsible» for perfect multicollinearity are unaffected.

Difference between statsmodel OLS and scikit-learn linear regression

Question

1 answers

solution1
0 2021-11-23 21:19:43

Difference between statsmodel OLS and scikit-learn linear regression

Question

1 answers

solution1 0 2021-11-23 21:19:43

solution1
0 2021-11-23 21:19:43