[英]sklearn LogisticRegression - plot displays too small coefficient
I am attempting to fit a logistic regression model to sklearn's iris dataset.我正在尝试将逻辑回归模型拟合到 sklearn 的 iris 数据集。 I get a probability curve that looks like it is too flat, aka the coefficient is too small.我得到的概率曲线看起来太平坦,也就是系数太小。 I would expect a probability over ninety percent by sepal length > 7 :我预计萼片长度 > 7 的概率超过 90%:
Is this probability curve indeed wrong?这个概率曲线真的错了吗? If so, what might cause that in my code?如果是这样,什么可能导致我的代码出现这种情况?
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
data = datasets.load_iris()
#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]
#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]
#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)
#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression , you will find a regularization parameter C
that can be passed as argument while training the logistic regression model.如果您参考http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression ,您会发现正则化参数C
可以在训练逻辑回归模型。
C : float, default: 1.0 Inverse of regularization strength; C : 浮点数,默认值:1.0 正则化强度的倒数; must be a positive float.必须是正浮点数。 Like in support vector machines, smaller values specify stronger regularization.与支持向量机一样,较小的值指定更强的正则化。
Now, if you try different values of this regularization parameter, you will find that larger values of C
leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C
value as high as 10
and plot) but at the same time are likely to have the risk to overfit , while the default value C=1
and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.现在,如果您尝试此正则化参数的不同值,您会发现较大的C
值会导致拟合曲线从输出(响应)二元变量的 0 值到 1 值的过渡更锐利,而更大的值会拟合具有以下特征的模型高方差(尝试更密切地对训练数据转换建模,我认为这就是您所期望的,然后您可以尝试将C
值设置为10
并绘制)但同时可能存在过度拟合的风险,而默认值C=1
和小于该值的值会导致高偏差并可能欠拟合,这就是机器学习中著名的偏差-方差权衡。
You can always use techniques like cross-validation to choose the C
value that is right for you.您始终可以使用交叉验证等技术来选择适合您的C
值。 The following code / figure shows the probability curve fitted with models of different complexity (ie, with different values of the regularization parameter C
, from 1
to 10
):以下代码/图显示了拟合不同复杂度模型的概率曲线(即正则化参数C
不同值,从1
到10
):
x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))
C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)):
lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
lgs.fit(lengths, is_setosa)
y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
plt.plot(x_values, y_values, label=labels[i])
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()
C
具有不同C
值的模型的预测概率Although you do not describe what you want to plot, I assume you want to plot the separating line.尽管您没有描述要绘制的内容,但我假设您要绘制分隔线。 It seems that you are confused with respect to the Logistic/sigmoid function.您似乎对 Logistic/sigmoid 函数感到困惑。 The decision function of Logistic Regression is a line. Logistic 回归的决策函数是一条线。
Your probability graph looks flat because you have, in a sense, "zoomed in" too much.您的概率图看起来很平坦,因为从某种意义上说,您“放大”了太多。
If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph )如果你看一个 sigmoid 函数的中间,它几乎是线性的,因为二阶导数几乎是 0(参见例如wolfram alpha graph )
Please note that the value's we are talking about are the results of -(m*x+b)
请注意,我们所说的值是-(m*x+b)
When we reduce the limits of your graph, say by using x_values = np.linspace(4, 7, 100)
, we get something which looks like a line:当我们减少图形的限制时,比如使用x_values = np.linspace(4, 7, 100)
,我们得到看起来像一条线的东西:
But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100)
, we get the clearer sigmoid:但另一方面,如果我们对限制感到疯狂,比如使用x_values = np.linspace(-10, 20, 100)
,我们会得到更清晰的 sigmoid:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.