sklearn LogisticRegression - 绘图显示系数太小

Question

I am attempting to fit a logistic regression model to sklearn's iris dataset.我正在尝试将逻辑回归模型拟合到 sklearn 的 iris 数据集。 I get a probability curve that looks like it is too flat, aka the coefficient is too small.我得到的概率曲线看起来太平坦，也就是系数太小。 I would expect a probability over ninety percent by sepal length > 7 :我预计萼片长度 > 7 的概率超过 90%：

Is this probability curve indeed wrong?这个概率曲线真的错了吗？ If so, what might cause that in my code?如果是这样，什么可能导致我的代码出现这种情况？

from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math

from sklearn.linear_model import LogisticRegression

data = datasets.load_iris()

#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]         

#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]

#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))         
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)

#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

Answer 1

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression , you will find a regularization parameter C that can be passed as argument while training the logistic regression model.如果您参考http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression ，您会发现正则化参数C可以在训练逻辑回归模型。

C : float, default: 1.0 Inverse of regularization strength; C : 浮点数，默认值：1.0 正则化强度的倒数； must be a positive float.必须是正浮点数。 Like in support vector machines, smaller values specify stronger regularization.与支持向量机一样，较小的值指定更强的正则化。

Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit , while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.现在，如果您尝试此正则化参数的不同值，您会发现较大的C值会导致拟合曲线从输出（响应）二元变量的 0 值到 1 值的过渡更锐利，而更大的值会拟合具有以下特征的模型高方差（尝试更密切地对训练数据转换建模，我认为这就是您所期望的，然后您可以尝试将C值设置为10并绘制）但同时可能存在过度拟合的风险，而默认值C=1和小于该值的值会导致高偏差并可能欠拟合，这就是机器学习中著名的偏差-方差权衡。

You can always use techniques like cross-validation to choose the C value that is right for you.您始终可以使用交叉验证等技术来选择适合您的C值。 The following code / figure shows the probability curve fitted with models of different complexity (ie, with different values of the regularization parameter C , from 1 to 10 ):以下代码/图显示了拟合不同复杂度模型的概率曲线（即正则化参数C不同值，从1到10 ）：

x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))

C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)): 
    lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
    lgs.fit(lengths, is_setosa)
    y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
    plt.plot(x_values, y_values, label=labels[i])

plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()

Predicted probs with models fitted with different values of `C`具有不同`C`值的模型的预测概率

Answer 2

Although you do not describe what you want to plot, I assume you want to plot the separating line.尽管您没有描述要绘制的内容，但我假设您要绘制分隔线。 It seems that you are confused with respect to the Logistic/sigmoid function.您似乎对 Logistic/sigmoid 函数感到困惑。 The decision function of Logistic Regression is a line. Logistic 回归的决策函数是一条线。

Answer 3

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.您的概率图看起来很平坦，因为从某种意义上说，您“放大”了太多。

If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph )如果你看一个 sigmoid 函数的中间，它几乎是线性的，因为二阶导数几乎是 0（参见例如wolfram alpha graph ）

Please note that the value's we are talking about are the results of -(m*x+b)请注意，我们所说的值是-(m*x+b)

When we reduce the limits of your graph, say by using x_values = np.linspace(4, 7, 100) , we get something which looks like a line:当我们减少图形的限制时，比如使用x_values = np.linspace(4, 7, 100) ，我们得到看起来像一条线的东西：

But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100) , we get the clearer sigmoid:但另一方面，如果我们对限制感到疯狂，比如使用x_values = np.linspace(-10, 20, 100) ，我们会得到更清晰的 sigmoid：

sklearn LogisticRegression - 绘图显示系数太小

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-03-08 21:11:44

Predicted probs with models fitted with different values of `C`具有不同`C`值的模型的预测概率

解决方案2
0 2017-03-08 18:20:56

解决方案3
0 2017-03-08 19:51:14

sklearn LogisticRegression - 绘图显示系数太小

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-03-08 21:11:44

Predicted probs with models fitted with different values of C具有不同C值的模型的预测概率

解决方案2 0 2017-03-08 18:20:56

解决方案3 0 2017-03-08 19:51:14

解决方案1
2 已采纳 2017-03-08 21:11:44

Predicted probs with models fitted with different values of `C`具有不同`C`值的模型的预测概率

解决方案2
0 2017-03-08 18:20:56

解决方案3
0 2017-03-08 19:51:14