Scikit-Learn: how to perform best subset GLM Poisson Regression?

Question

Can someone show me how to perform best subset GLM Poisson regression using Pipeline and GridSearchCV ? Specifically, I do not know which scikit-learn's function does best subset selection and how to embed it into pipeline and GridSearchCV

In addition, how do I include interaction terms within the features, to be selected from Best Subset Algorithm? And how do I embed this in pipeline and GridSearchCV ?

from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor =  ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
                                                  ('discrete',discrete_transformer,discrete_col)],remainder='passthrough')

pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                           ('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])

param_grid = {  ??? different combinations of features ????}

gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)

Answer 1

Currently as far as I understand sklearn has no "brute force" / exhaustive feature search for best subset. However there are various classes:

https://scikit-learn.org/0.15/modules/classes.html#module-sklearn.feature_selection
including some hidden interesting ones --- https://scikit-learn.org/0.15/auto_examples/feature_stacker.html#example-feature-stacker-py .

Now pipelining for this can be tricky. When you stack classes/methods in a pipeline and call .fit() all methods until final have to expose .transform(). If a method exposes .transform() then this .transform() is used as the input in your next step etc. In your last step you can have any valid model as a final object but all previous must expose .transform() in order to chain one to another. So depending on which feature selection approach you pick your code will differ. See below

Pablo Picasso is widely quoted as having said that “good artists borrow, great artists steal.”... So following this great answer https://stackoverflow.com/a/42271829/4471672 lets borrow, fix and expand a bit further.

Imports

### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop


from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor

### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Data

X, y = load_diabetes(as_frame=True, return_X_y=True)

Supplemental function



### make parameter grid for your GridSearchCV 
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).items():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

    return final_params

#1 Example using RFE feature selection class

(aka where feature selection class does not return transform, but is of wrapper type)

### pipelines work from one step to another as long as previous step returns transform 

### adjust next steps to fit your problem space
### below in all_params_grid 

### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is 
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 

pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
                  'classifier':['rf']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
                          'with_mean':[True,False]
                         }, 

                   'rf':{'object':RFE(estimator=PoissonRegressor(), 
                                        step=1,
                                        verbose=0),
                         'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to  1  for example to see how it influences accuracy of your grid search
                         'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
                         'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',RFE(estimator=PoissonRegressor()))])
pipe

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

#2 Example using KBest feature selection class

(an example where feature selection exposes .transform()


pipeline_steps = {'transform':['ss'],
                  'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   
                   'kbest': {'object': SelectKBest(),
                             'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
                             },

                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
                       ('classifier',PoissonRegressor())])
pipe

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

#3 Brute force / looping over all possible combinations with pipeline

pipeline_steps = {'transform':['ss'],
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',PoissonRegressor())])
pipe

feature_combo = []  ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score

stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
    for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
        ### run it
        gs_en_cv = GridSearchCV(pipe, 
                                param_grid=param_grids_list,
                                cv=KFold(n_splits=3,
                                         shuffle = True,
                                         random_state=123),
                               scoring = 'neg_root_mean_squared_error',
                               return_train_score=True,

                                ### change verbose to higher number for more print outs
                                ### about fitting info which can also verify that 
                                ### all parameters you specify are getting fit 
                               verbose = 0)

        fitted = gs_en_cv.fit(X[list(subset)],y)
    
        score.append(fitted.best_score_) ### append results
        params.append(fitted.best_params_) ### append results
        feature_combo.append(list(subset)) ### append results

### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
                   'score':score,
                   'params':params})

df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]

PS
For interactions (I guess you mean like creating new features by combining originals?) I would just create those feature interactions before and include them at your .fit() because otherwise how do you know if for example you get the best interaction features since you are doing your interactions AFTER you selected a subset of them? Why not interact them from the start and let gridCV feature selection portion tell you whats best?

Scikit-Learn: how to perform best subset GLM Poisson Regression?

Question

1 answers

solution1
3 ACCPTED 2022-06-17 16:31:12

#1 Example using RFE feature selection class

#2 Example using KBest feature selection class

#3 Brute force / looping over all possible combinations with pipeline

Scikit-Learn: how to perform best subset GLM Poisson Regression?

Question

1 answers

solution1 3 ACCPTED 2022-06-17 16:31:12

#1 Example using RFE feature selection class

#2 Example using KBest feature selection class

#3 Brute force / looping over all possible combinations with pipeline

solution1
3 ACCPTED 2022-06-17 16:31:12