简体   繁体   中英

confusing results with logistic regression in python

I'm doing logistic regression in Python with this example from wikipedia. link to example

here's the code I have:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
Z = [[0.5], [0.75], [1.0], [1.25], [1.5], [1.75], [1.75], [2.0], [2.25], [2.5], [2.75], [3.0], [3.25], [3.5], [4.0], [4.25], [4.5], [4.75], [5.0], [5.5]] # number of hours spent studying
y = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1] # 0=failed, 1=pass

lr.fit(Z,y)

results for this are

lr.coef_
array([[ 0.61126347]])

lr.intercept_
array([-1.36550178])

while they get values 1.5046 for hour coefficient and -4.0777 intercept. why are the results so different? their prediction for 1 hour of study is probability 0.07 to pass, while i get 0.32 with this model, these are drastically different results.

The "problem" is that LogisticRegression in scikit-learn uses L2-regularization (aka Tikhonov regularization, aka Ridge, aka normal prior). Please read sklearn user guide about logistic regression for implementational details.

In practice, it means that LogisticRegression has a parameter C , which by default equals 1 . The smaller C , the more regularization there is - it means, coef_ grows smaller, and intercept_ larger, which increases numerical stability and reduces overfitting.

If you set C very large, the effect of regularization will vanish. With

lr = LogisticRegression(C=100500000)

you get coef_ and intercept_ respectively

[[ 1.50464535]]
[-4.07771322]

just like in the Wikipedia article.

Some more theory . Overfitting is a problem where there are lots of features, but not too much examples. A simple rule of thumb: use small C, if n_obs/n_features is less that 10. In the wiki example, there is one feature and 20 observations, so simple logistic regression would not overfit even with large C.

Another use case for small C is convergence problems. They may happen if positive and negative examples can be perfectly separated or in case of multicollinearity (which again is more likely if n_obs/n_features is small), and lead to infinite growth of coefficient in the non-regularized case.

I think the problem is arising from the fact that you have

Z = [[0.5], [0.75], [1.0], [1.25], [1.5], [1.75], [1.75], [2.0], [2.25], [2.5], [2.75], [3.0], [3.25], [3.5], [4.0], [4.25], [4.5], [4.75], [5.0], [5.5]]

but instead it should be

Z = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25 ...]

Try this

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM