简体   繁体   English

Scikit-Learn:如何执行最佳子集 GLM 泊松回归?

[英]Scikit-Learn: how to perform best subset GLM Poisson Regression?

Can someone show me how to perform best subset GLM Poisson regression using Pipeline and GridSearchCV ?有人可以告诉我如何使用PipelineGridSearchCV执行最佳子集 GLM Poisson 回归吗? Specifically, I do not know which scikit-learn's function does best subset selection and how to embed it into pipeline and GridSearchCV具体来说,我不知道 scikit-learn 的哪个函数最适合子集选择以及如何将其嵌入到管道和 GridSearchCV

In addition, how do I include interaction terms within the features, to be selected from Best Subset Algorithm?此外,如何在特征中包含交互项,以便从最佳子集算法中选择? And how do I embed this in pipeline and GridSearchCV ?以及如何将其嵌入到管道和 GridSearchCV 中?

from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor =  ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
                                                  ('discrete',discrete_transformer,discrete_col)],remainder='passthrough')

pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                           ('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])

param_grid = {  ??? different combinations of features ????}

gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)

Currently as far as I understand sklearn has no "brute force" / exhaustive feature search for best subset.目前据我了解,sklearn 没有“蛮力”/详尽的特征搜索来寻找最佳子集。 However there are various classes:但是有不同的类:


Now pipelining for this can be tricky.现在为此流水线可能会很棘手。 When you stack classes/methods in a pipeline and call .fit() all methods until final have to expose .transform().当您在管道中堆叠类/方法并调用 .fit() 时,所有方法直到最终都必须公开 .transform()。 If a method exposes .transform() then this .transform() is used as the input in your next step etc. In your last step you can have any valid model as a final object but all previous must expose .transform() in order to chain one to another.如果一个方法暴露了 .transform() 那么这个 .transform() 被用作你下一步的输入等等。在你的最后一步中,你可以有任何有效的模型作为最终的对象,但之前所有的都必须按顺序暴露 .transform()将一个链接到另一个。 So depending on which feature selection approach you pick your code will differ.因此,根据您选择代码的功能选择方法,您的代码会有所不同。 See below见下文


Pablo Picasso is widely quoted as having said that “good artists borrow, great artists steal.”... So following this great answer https://stackoverflow.com/a/42271829/4471672 lets borrow, fix and expand a bit further.巴勃罗·毕加索(Pablo Picasso)被广泛引用,他说过“优秀的艺术家借用,伟大的艺术家偷窃。”......所以遵循这个很好的答案https://stackoverflow.com/a/42271829/4471672让我们进一步借用、修复和扩展。


Imports进口

### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop


from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor

### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Data数据

X, y = load_diabetes(as_frame=True, return_X_y=True)

Supplemental function补充功能



### make parameter grid for your GridSearchCV 
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).items():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

    return final_params

#1 Example using RFE feature selection class #1 使用 RFE 特征选择类的示例

(aka where feature selection class does not return transform, but is of wrapper type) (也就是特征选择类不返回变换,而是包装类型)

### pipelines work from one step to another as long as previous step returns transform 

### adjust next steps to fit your problem space
### below in all_params_grid 

### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is 
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 

pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
                  'classifier':['rf']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
                          'with_mean':[True,False]
                         }, 

                   'rf':{'object':RFE(estimator=PoissonRegressor(), 
                                        step=1,
                                        verbose=0),
                         'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to  1  for example to see how it influences accuracy of your grid search
                         'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
                         'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

在此处输入图像描述

### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',RFE(estimator=PoissonRegressor()))])
pipe

在此处输入图像描述

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

在此处输入图像描述


#2 Example using KBest feature selection class #2 使用 KBest 特征选择类的示例

(an example where feature selection exposes .transform() (特征选择公开 .transform() 的示例


pipeline_steps = {'transform':['ss'],
                  'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   
                   'kbest': {'object': SelectKBest(),
                             'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
                             },

                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

在此处输入图像描述

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
                       ('classifier',PoissonRegressor())])
pipe

在此处输入图像描述

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

在此处输入图像描述


#3 Brute force / looping over all possible combinations with pipeline #3 蛮力/循环使用管道的所有可能组合

pipeline_steps = {'transform':['ss'],
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

在此处输入图像描述

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',PoissonRegressor())])
pipe

在此处输入图像描述

feature_combo = []  ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score

stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
    for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
        ### run it
        gs_en_cv = GridSearchCV(pipe, 
                                param_grid=param_grids_list,
                                cv=KFold(n_splits=3,
                                         shuffle = True,
                                         random_state=123),
                               scoring = 'neg_root_mean_squared_error',
                               return_train_score=True,

                                ### change verbose to higher number for more print outs
                                ### about fitting info which can also verify that 
                                ### all parameters you specify are getting fit 
                               verbose = 0)

        fitted = gs_en_cv.fit(X[list(subset)],y)
    
        score.append(fitted.best_score_) ### append results
        params.append(fitted.best_params_) ### append results
        feature_combo.append(list(subset)) ### append results

在此处输入图像描述

### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
                   'score':score,
                   'params':params})

df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]

在此处输入图像描述


PS附言
For interactions (I guess you mean like creating new features by combining originals?) I would just create those feature interactions before and include them at your .fit() because otherwise how do you know if for example you get the best interaction features since you are doing your interactions AFTER you selected a subset of them?对于交互(我猜你的意思是像通过组合原件来创建新功能?)我只会在之前创建这些功能交互并将它们包含在你的 .fit() 中,否则你怎么知道你是否获得了自你以来最好的交互功能在您选择其中的一个子集之后进行互动吗? Why not interact them from the start and let gridCV feature selection portion tell you whats best?为什么不从一开始就与它们进行交互,让 gridCV 特征选择部分告诉你什么是最好的呢?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用scikit-learn执行多元线性回归? - How to perform multivariable linear regression with scikit-learn? 比较 Scikit-learn (Python) 和 glm (R) 中的逻辑回归 - Comparing logistic regression in Scikit-learn (Python) and glm (R) Python-scikit-learn:如何在决策树和回归树中指定验证子集? - Python - scikit-learn: how to specify a validation subset in decision and regression trees? 如何将图像文件从URL转换为scikit-learn中可用于执行多元线性回归的格式 - how to convert an image file from a URL to a format in scikit-learn that can be used to perform a Multivariate Linear Regression scikit-learn中的线性回归 - Linear regression in scikit-learn 如何在scikit-learn中实现多项式逻辑回归? - How to implement polynomial logistic regression in scikit-learn? 如何理解 scikit-learn 逻辑回归代码中的损失函数? - How to understand the loss function in scikit-learn logestic regression code? 如何在 scikit-learn 中计算回归的成本函数 - How compute cost function for regression in scikit-learn 如何使用scikit-learn对python中的数据集进行多元线性回归? - How to preform multiple linear regression on a dataset in python with scikit-learn? 如何指定 scikit-learn 的高斯过程回归的先验? - How to specify the prior for scikit-learn's Gaussian process regression?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM