简体   繁体   English

sklearn LogisticRegression - 绘图显示系数太小

[英]sklearn LogisticRegression - plot displays too small coefficient

I am attempting to fit a logistic regression model to sklearn's iris dataset.我正在尝试将逻辑回归模型拟合到 sklearn 的 iris 数据集。 I get a probability curve that looks like it is too flat, aka the coefficient is too small.我得到的概率曲线看起来太平坦,也就是系数太小。 I would expect a probability over ninety percent by sepal length > 7 :我预计萼片长度 > 7 的概率超过 90%:

在此处输入图片说明

Is this probability curve indeed wrong?这个概率曲线真的错了吗? If so, what might cause that in my code?如果是这样,什么可能导致我的代码出现这种情况?

from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math

from sklearn.linear_model import LogisticRegression

data = datasets.load_iris()

#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]         

#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]

#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))         
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)

#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression , you will find a regularization parameter C that can be passed as argument while training the logistic regression model.如果您参考http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression ,您会发现正则化参数C可以在训练逻辑回归模型。

C : float, default: 1.0 Inverse of regularization strength; C : 浮点数,默认值:1.0 正则化强度的倒数; must be a positive float.必须是正浮点数。 Like in support vector machines, smaller values specify stronger regularization.与支持向量机一样,较小的值指定更强的正则化。

Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit , while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.现在,如果您尝试此正则化参数的不同值,您会发现较大的C值会导致拟合曲线从输出(响应)二元变量的 0 值到 1 值的过渡更锐利,而更大的值会拟合具有以下特征的模型高方差(尝试更密切地对训练数据转换建模,我认为这就是您所期望的,然后您可以尝试将C值设置为10并绘制)但同时可能存在过度拟合的风险,而默认值C=1和小于该值的值会导致高偏差并可能欠拟合,这就是机器学习中著名的偏差-方差权衡

You can always use techniques like cross-validation to choose the C value that is right for you.您始终可以使用交叉验证等技术来选择适合您的C值。 The following code / figure shows the probability curve fitted with models of different complexity (ie, with different values of the regularization parameter C , from 1 to 10 ):以下代码/图显示了拟合不同复杂度模型的概率曲线(即正则化参数C不同值,从110 ):

x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))

C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)): 
    lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
    lgs.fit(lengths, is_setosa)
    y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
    plt.plot(x_values, y_values, label=labels[i])

plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()

Predicted probs with models fitted with different values of C具有不同C值的模型的预测概率

在此处输入图片说明

Although you do not describe what you want to plot, I assume you want to plot the separating line.尽管您没有描述要绘制的内容,但我假设您要绘制分隔线。 It seems that you are confused with respect to the Logistic/sigmoid function.您似乎对 Logistic/sigmoid 函数感到困惑。 The decision function of Logistic Regression is a line. Logistic 回归的决策函数是一条线。

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.您的概率图看起来很平坦,因为从某种意义上说,您“放大”了太多。

If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph )如果你看一个 sigmoid 函数的中间,它几乎是线性的,因为二阶导数几乎是 0(参见例如wolfram alpha graph

Please note that the value's we are talking about are the results of -(m*x+b)请注意,我们所说的值是-(m*x+b)

When we reduce the limits of your graph, say by using x_values = np.linspace(4, 7, 100) , we get something which looks like a line:当我们减少图形的限制时,比如使用x_values = np.linspace(4, 7, 100) ,我们得到看起来像一条线的东西: 在此处输入图片说明

But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100) , we get the clearer sigmoid:但另一方面,如果我们对限制感到疯狂,比如使用x_values = np.linspace(-10, 20, 100) ,我们会得到更清晰的 sigmoid: 在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM