Scikit了解Logistic回归的困惑

Question

I'm having some trouble understanding sckit-learn's LogisticRegression() method. 我在理解sckit-learn的LogisticRegression（）方法时遇到了一些麻烦。 Here's a simple example 这是一个简单的例子

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)

   Age  ZepplinFan
0   13           0
1   25           0
2   40           1
3   51           0
4   55           1
5   58           1

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])

# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826

# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]

# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()

在此处输入图片说明

Obviously this doesn't appear to be a good fit. 显然，这似乎不适合。 Furthermore, when I do this exercise in RI get different coefficients and a model that makes more sense. 此外，当我在RI中进行此练习时，会得到不同的系数和一个更有意义的模型。

lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]

mylogit$coeff
(Intercept)         Age 
    -4.8434      0.1148 

ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))

在此处输入图片说明

What am I missing here? 我在这里想念什么？

Answer 1

The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. 您面临的问题与scikit learning正在使用正则化 Logistic回归这一事实有关。 The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. 正则项可以控制对数据的拟合与对未来未知数据的泛化之间的权衡。 The parameter C is used to control the regularization, in your case: 在您的情况下，参数C用于控制正则化：

lr = LogisticRegression(C=100)

will generate what you are looking for: 将生成您想要的内容： 在此处输入图片说明

As you have discovered, changing the value of the intercept_scaling parameter also achieves similar effect. 如您所知，更改intercept_scaling参数的值也可获得类似的效果。 The reason is also regularization or rather how it affects estimation of the bias in the regression. 原因还在于正则化，或者更确切地说，它是如何影响回归中偏差的估计。 The larger intercept_scaling parameter will effectively reduce the impact of regularization on the bias. 较大的intercept_scaling参数将有效减少正则化对偏差的影响。

For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression 有关scikit-learn使用的LR和求解器的实现的更多信息，请检查： http ://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Scikit了解Logistic回归的困惑

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-05-23 20:10:27

Scikit了解Logistic回归的困惑

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-05-23 20:10:27

解决方案1
4 已采纳 2015-05-23 20:10:27