简体   繁体   中英

sklearn LogisticRegression - plot displays too small coefficient

I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :

在此处输入图片说明

Is this probability curve indeed wrong? If so, what might cause that in my code?

from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math

from sklearn.linear_model import LogisticRegression

data = datasets.load_iris()

#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]         

#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]

#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))         
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)

#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression , you will find a regularization parameter C that can be passed as argument while training the logistic regression model.

C : float, default: 1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit , while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.

You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (ie, with different values of the regularization parameter C , from 1 to 10 ):

x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))

C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)): 
    lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
    lgs.fit(lengths, is_setosa)
    y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
    plt.plot(x_values, y_values, label=labels[i])

plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()

Predicted probs with models fitted with different values of C

在此处输入图片说明

Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.

If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph )

Please note that the value's we are talking about are the results of -(m*x+b)

When we reduce the limits of your graph, say by using x_values = np.linspace(4, 7, 100) , we get something which looks like a line: 在此处输入图片说明

But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100) , we get the clearer sigmoid: 在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM