简体   繁体   中英

Comparing logistic regression in Scikit-learn (Python) and glm (R)

I am trying to compare logistic regression in R glm stats package and Scikit-learn Python. Here is my dataset. dataset .

Here is python code

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("dataset.csv")
df = df.join(pd.get_dummies(df['var2'], prefix = 'var2', drop_first= True))
df.drop(columns = ['var2'], inplace = True)

X = df.loc[:,df.columns != 'y']
y = df.y

model = LogisticRegression(fit_intercept=True, penalty = 'none' )
model.fit(X, y)
prob = model.predict_proba(X)
model.coef_

Here are the coefficients:

var1, var3, var4, var2_B, var2_C
-1.833653e-07, 2.823982e-12, 2.568188e-12, -4.116901e-13, 5.514602e-14

And here is corresponding R code:

df=read_csv(file = "dataset.csv")
glm_fit <- glm(y ~.,data = df,   family=binomial(link = 'logit'))
summary(glm_fit)

Here are coefficients:

(Intercept) -6.459e-01 
var1        -1.042e-07  
var2B       -7.731e-01  
var2C        1.880e+00  
var3        -1.124e-04  
var4         2.994e-03

It is easy to check that matrix that goes into solver is the same in both case. As you can see, coefficients are drastically different. Also ROC AUC in R comes up way better than in Python. I understand that different solvers are used, but difference in solution seems too big. Is there way to troubleshoot it?

Indeed it seems to be a matter of the lbfgs solver (the default used by sklearn ) failing to work well on unscaled input data. Scaling the inputs first and modifying the coefficients accordingly, I recover basically the same coefficients you reported from glm :

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
model.fit(X_sc, y)
model.coef_ / scaler.scale_

The sag and saga solvers suffer the same fate, while newton-cg actually gets close and throws convergence warnings. Increasing the number of iterations just adds a warning about rounding errors preventing better convergence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM