简体   繁体   English

Scikit了解Logistic回归的困惑

[英]Scikit Learn Logistic Regression confusion

I'm having some trouble understanding sckit-learn's LogisticRegression() method. 我在理解sckit-learn的LogisticRegression()方法时遇到了一些麻烦。 Here's a simple example 这是一个简单的例子

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)

   Age  ZepplinFan
0   13           0
1   25           0
2   40           1
3   51           0
4   55           1
5   58           1

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])

# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826

# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]

# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()

在此处输入图片说明

Obviously this doesn't appear to be a good fit. 显然,这似乎不适合。 Furthermore, when I do this exercise in RI get different coefficients and a model that makes more sense. 此外,当我在RI中进行此练习时,会得到不同的系数和一个更有意义的模型。

lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]

mylogit$coeff
(Intercept)         Age 
    -4.8434      0.1148 

ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))

在此处输入图片说明

What am I missing here? 我在这里想念什么?

The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. 您面临的问题与scikit learning正在使用正则化 Logistic回归这一事实有关。 The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. 正则项可以控制对数据的拟合与对未来未知数据的泛化之间的权衡。 The parameter C is used to control the regularization, in your case: 在您的情况下,参数C用于控制正则化:

lr = LogisticRegression(C=100)

will generate what you are looking for: 将生成您想要的内容: 在此处输入图片说明

As you have discovered, changing the value of the intercept_scaling parameter also achieves similar effect. 如您所知,更改intercept_scaling参数的值也可获得类似的效果。 The reason is also regularization or rather how it affects estimation of the bias in the regression. 原因还在于正则化,或者更确切地说,它是如何影响回归中偏差的估计。 The larger intercept_scaling parameter will effectively reduce the impact of regularization on the bias. 较大的intercept_scaling参数将有效减少正则化对偏差的影响。

For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression 有关scikit-learn使用的LR和求解器的实现的更多信息,请检查: http ://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM