sklearn LogisticRegression - plot displays too small coefficient

Question

I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :

Is this probability curve indeed wrong? If so, what might cause that in my code?

from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math

from sklearn.linear_model import LogisticRegression

data = datasets.load_iris()

#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]         

#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]

#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))         
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)

#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

Answer 1

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression , you will find a regularization parameter C that can be passed as argument while training the logistic regression model.

C : float, default: 1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit , while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.

You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (ie, with different values of the regularization parameter C , from 1 to 10 ):

x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))

C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)): 
    lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
    lgs.fit(lengths, is_setosa)
    y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
    plt.plot(x_values, y_values, label=labels[i])

plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()

Predicted probs with models fitted with different values of `C`

Answer 2

Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.

Answer 3

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.

If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph )

Please note that the value's we are talking about are the results of -(m*x+b)

When we reduce the limits of your graph, say by using x_values = np.linspace(4, 7, 100) , we get something which looks like a line:

But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100) , we get the clearer sigmoid:

sklearn LogisticRegression - plot displays too small coefficient

Question

3 answers

solution1
2 ACCPTED 2017-03-08 21:11:44

Predicted probs with models fitted with different values of `C`

solution2
0 2017-03-08 18:20:56

solution3
0 2017-03-08 19:51:14

sklearn LogisticRegression - plot displays too small coefficient

Question

3 answers

solution1 2 ACCPTED 2017-03-08 21:11:44

Predicted probs with models fitted with different values of C

solution2 0 2017-03-08 18:20:56

solution3 0 2017-03-08 19:51:14

solution1
2 ACCPTED 2017-03-08 21:11:44

Predicted probs with models fitted with different values of `C`

solution2
0 2017-03-08 18:20:56

solution3
0 2017-03-08 19:51:14