简体   繁体   中英

Logit estimator in `statsmodels` and `sklearn`

I'm pretty sure it's a feature, not a bug, but I would like to know if there is a way to make sklearn and statsmodels match in their logit estimates. A very simple example:

import numpy as np
import statsmodels.formula.api as sm
from sklearn.linear_model import LogisticRegression

np.random.seed(123)

n = 100
y = np.random.random_integers(0, 1, n)
x = np.random.random((n, 2))
# Constant term
x[:, 0] = 1.

The estimates with statsmodels :

sm_lgt = sm.Logit(y, x).fit()
    Optimization terminated successfully.
             Current function value: 0.675320
             Iterations 4
print sm_lgt.params
    [ 0.38442   -1.1429183]

And the estimates with sklearn :

sk_lgt = LogisticRegression(fit_intercept=False).fit(x, y)
print sk_lgt.coef_
    [[ 0.16546794 -0.72637982]]

I think it's got to do with the implementation in sklearn , which uses some sort of regularization. Is there an option to estimate a barebones logit as in statsmodels (it's substantially faster and scales much more nicely). Also, does sklearn provide inference (standard errors) or marginal effects?

Is there an option to estimate a barebones logit as in statsmodels

You can set the C (inverse regularization strength) parameter to an arbitrarily high constant, as long as it's finite:

>>> sk_lgt = LogisticRegression(fit_intercept=False, C=1e9).fit(x, y)
>>> print(sk_lgt.coef_)
[[ 0.38440594 -1.14287175]]

Turning the regularization off is impossible because this is not supported by the underlying solver, Liblinear.

Also, does sklearn provide inference (standard errors) or marginal effects?

No. There's a proposal to add this, but it's not in the master codebase yet.

As an additional note, I was struggling with difference in results when my matrix was collinear. Obviously this means there should be some additional preprocessing to get reliable result, but I was still hoping to find out why I got a result with sklearn but statsmodels errored out.

Short answer: setting solver='bfgs' when calling fit in statsmodels gives almost identical results to the sklearn model, even in the case of collinear variables (once one has taken care of the fact that default for sm is no intercept, and default for sklearn fits the intercept)

Example (adapted from a similar question on OLS):

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

np.random.seed = 237
num_samples=1000
X=np.random.random((num_samples, 2))
X[:, 1] = 2*X[:, 0]
X_sm = sm.add_constant(X)

beta = [1, -2, .5]
error = np.random.random(num_samples)
y = np.round(1/(1+np.exp( -(np.dot(X_sm, beta)) + error   )))  # y = 1/(1+exp(-beta*x))

lr = LogisticRegression(C=1e9).fit(X, y)

print "sklearn:"
print lr.intercept_
print lr.coef_

print "statsmodels:"
print sm.Logit(y, X_sm).fit(method='bfgs').params  # method='nm' or default method errors out

(PS If anyone has comments on the math behind these two solvers and the reliability of results I would love to hear it! I find it interesting that sklearn doesn't even throw a warning for this...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM