简体   繁体   中英

scikit-learn how to see feature importance using pipeline and how to do a logistic + ridge regression

Two questions:

I'm trying to run a model that predicts churn. A lot of my features have multicollinearity issues. To address this problem I'm trying to penalize the coefficients with Ridge.

More specifically I'm trying to run a logistic regression but apply Ridge penalties (not sure if that makes sense) to the model...

Questions:

  1. Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (ie LogisticRegression(apply_penality=Ridge)

  2. I'm trying to determine feature importance and through some research, it seems like I need to use this:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

However, I'm confused on how to access this function if my model has been built around sklearn.pipeline.make_pipeline function.

I'm just trying to figure out which independent variables have the most importance in predicting my label.

Code below for reference

#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')

#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]

#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)


'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
 The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
 Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
 GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())

#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()

#putting together a parameter grid to search over using grid search
params={
    'selectkbest__k':[1,2,3,4,5,6],
    'ridge__fit_intercept':[True,False],
    'ridge__alpha':[0.01,0.1,1,10],
    'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)

#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')

#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
       'split3_test_score', 'split4_test_score', 'mean_test_score',
       'std_test_score', 'rank_test_score']].head()

#checking the selected permutation of parameters
gs.best_params_

'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs. 
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)

#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)

Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (ie LogisticRegression(apply_penality=Ridge)

So the question between ridge regression and Logistic here comes down to whether or not you are trying to do classification or regression. If you want to predict the quantity of churn on some continuous basis use ridge, if you want to predict did someone churn or are they likely to churn use logistic regression.

Sklearn's LogisticRegression uses l2 normalization by default which is equivalent to the regularization used by ridge regeression. So you should be fine using that if it's the regularization that you want: )

I'm trying to determine feature importance and through some research, it seems like I need to use this.

In general you can access the elements of a pipeline through the named_steps attribute. so in your case if you wanted to access SelectKBest you could do:

pipe.named_steps["SelectKBest"].get_feature_names()

That's going to get you the feature names, now you still need the values. Here you have to access your models learned coefficients. For ridge and logistic regression it should be something like:

pipe.named_steps["logisticregression"].coef_

I have a blog post about this if you want a more detailed tutorial here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM