I am doing logistic regression on a boolean 0/1 dataset (predicting the probability of a certain age giving you a salary over some amount), and I am getting very different results with sklearn and StatsModels, where sklearn is very wrong .
I have set the sklearn penalty to None and the intercept term to false to make the function more similar to StatsModels, but I can't see how to make sklearn give a sensible answer.
The grey lines are the original datapoints at 0 or 1, I just scaled 1 down to 0.1 on the plot to be visible.
Variables:
# X and Y
X = df.age.values.reshape(-1,1)
X_poly = PolynomialFeatures(degree=4).fit_transform(X)
y_bool = np.array(df.wage.values > 250, dtype = "int")
# Generate a sequence of ages
age_grid = np.arange(X.min(), X.max()).reshape(-1,1)
age_grid_poly = PolynomialFeatures(degree=4).fit_transform(age_grid)
Code is the following:
# sklearn Model
clf = LogisticRegression(penalty = None, fit_intercept = False,max_iter = 300).fit(X=X_poly, y=y_bool)
preds = clf.predict_proba(age_grid_poly)
# Plot
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(X ,y_bool/10, s=30, c='grey', marker='|', alpha=0.7)
plt.plot(age_grid, preds[:,1], color = 'r', alpha = 1)
plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()
# StatsModels
log_reg = sm.Logit(y_bool, X_poly).fit()
preds = log_reg.predict(age_grid_poly)
# Plot
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(X ,y_bool/10, s=30, c='grey', marker='|', alpha=0.7)
plt.plot(age_grid, preds, color = 'r', alpha = 1)
plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()
I couldn't reproduce exactly the results once I don't have the dataset or the specific versions of the scikit-learn and statsmodels. However, I don't think you were able to successfully removing the regularization parameter in your code. The documentation states that you should pass the string 'none'
, not the constant None
.
Please, refer to sklearn.linear_model.LogisticRegression documentation:
penalty{'l1', 'l2', 'elasticnet', 'none'}, default='l2' Used to specify the norm used in the penalization. The 'newton-cg', 'sag' and 'lbfgs' solvers support only l2 penalties. 'elasticnet' is only supported by the 'saga' solver. If 'none' (not supported by the liblinear solver), no regularization is applied.
I think it is easier to understand the difference by investigating the coefficient, instead of using a plot. You can investigate it directly using the property coef_
for the scikit-learn model and params
for the statsmodels model.
In logical terms, you should expect the coefficient to be lower in the scikit-learn model if the regularization parameter is not properly disabled.
This appears to be because of sklearn's implementation being very scale-dependent (and the polynomial terms being quite large). By scaling the data first, I get qualitatively the same result.
# sklearn Model
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
clf = Pipeline([
('scale', StandardScaler()),
('lr', LogisticRegression(penalty='none', fit_intercept=True, max_iter=1000)),
]).fit(X=X_poly, y=y_bool)
preds = clf.predict_proba(age_grid_poly)
# Plot
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(X ,y_bool/10, s=30, c='grey', marker='|', alpha=0.7)
plt.plot(age_grid, preds[:,1], color = 'r', alpha = 1)
plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()
Note that we need to set fit_intercept=True
in this case, because the StandardScaler
kills the constant column (making it all zeros) coming from the PolynomialFeatures
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.