简体   繁体   中英

all coefficients turn zero in Logistic regression using scikit learn

I am working on logistic regression using scikit learn in python. I have the data file that can be downloaded via the following link.

link for data

Below is my code for machine learning part.

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.3)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)

when I print coefficients, it turns out all zero. Can anyone advise me on that?

Let me explain a little bit about my objective. The problem seems to be a classification problem as we can only see 0 or 1 in Ytrain and Ytest. if we put a simple example, 0 can be considered as missed , 1 can be considered as scored . what I am trying to do is to compute the probability scoring for each event when a shot is taken place.

Thanks in advance.

Regards,

Zep

我只是在套索中更改alpha: 我的结果

Your Y variable contains only 0 s and 1 s. If you still want to apply regression on this data then use a GridSearch for different alpha parameters.

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.0009)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)

Results

MC learning completed
0.37884924358295613
0.3806187071242917
[ 0.00078099  0.13397938 -0.00554932  0.00194722  0.00232949 -0.01100195
 -0.01363906  0.13031317 -0.00146605]

GridSearchCV

from sklearn.model_selection import GridSearchCV
import numpy as np

# Define the grid for the alpha parameter
parameters = {'alpha':[0.01, 0.001, 0.0005]}

# Fit it on X, Y and define the cv parameter for cross-validation
clf = GridSearchCV(lasso, parameters, cv = 3)
clf.fit(X, Y)

# Get the best parameters and model
print(clf.best_estimator_)

Note : To define a specific parameter space use: parameters = {'alpha': np.arange(0.001,1,0.02)}


EDIT 1: After taking into account the last paragraph that you just added in your question, use this:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)

# Logistic Regression (aka logit, MaxEnt) classifier.
lr = LogisticRegression()
lr.fit(X_train,y_train)

# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])

# The proba of the first testing sample to belong to class 0 is 0.8704 and to class 1 0.1295
[[0.87046267 0.12953733]
 [0.87797594 0.12202406]
 [0.80046704 0.19953296]]

The data in Y looks like classes. They are either 0 or 1. So you should use classification algorithms and then use the coeff to get the probability.

Most scikit classifiers have a predict_proba() which you can use the get the probability directly.

If there is a need to absolutely use the regression models, then you can try LinearRegression which will use Ordinary least squares method, or LassoCV which will automatically tune the alphas to suit the need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM