[英]Scikit-Learn: how to perform best subset GLM Poisson Regression?
Can someone show me how to perform best subset GLM Poisson regression using Pipeline
and GridSearchCV
?有人可以告诉我如何使用
Pipeline
和GridSearchCV
执行最佳子集 GLM Poisson 回归吗? Specifically, I do not know which scikit-learn's function does best subset selection and how to embed it into pipeline and GridSearchCV具体来说,我不知道 scikit-learn 的哪个函数最适合子集选择以及如何将其嵌入到管道和 GridSearchCV
In addition, how do I include interaction terms within the features, to be selected from Best Subset Algorithm?此外,如何在特征中包含交互项,以便从最佳子集算法中选择? And how do I embed this in pipeline and GridSearchCV ?
以及如何将其嵌入到管道和 GridSearchCV 中?
from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor = ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
('discrete',discrete_transformer,discrete_col)],remainder='passthrough')
pipeline = Pipeline(steps=[('preprocessor',preprocessor),
('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])
param_grid = { ??? different combinations of features ????}
gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)
Currently as far as I understand sklearn has no "brute force" / exhaustive feature search for best subset.目前据我了解,sklearn 没有“蛮力”/详尽的特征搜索来寻找最佳子集。 However there are various classes:
但是有不同的类:
Now pipelining for this can be tricky.现在为此流水线可能会很棘手。 When you stack classes/methods in a pipeline and call .fit() all methods until final have to expose .transform().
当您在管道中堆叠类/方法并调用 .fit() 时,所有方法直到最终都必须公开 .transform()。 If a method exposes .transform() then this .transform() is used as the input in your next step etc. In your last step you can have any valid model as a final object but all previous must expose .transform() in order to chain one to another.
如果一个方法暴露了 .transform() 那么这个 .transform() 被用作你下一步的输入等等。在你的最后一步中,你可以有任何有效的模型作为最终的对象,但之前所有的都必须按顺序暴露 .transform()将一个链接到另一个。 So depending on which feature selection approach you pick your code will differ.
因此,根据您选择代码的功能选择方法,您的代码会有所不同。 See below
见下文
Pablo Picasso is widely quoted as having said that “good artists borrow, great artists steal.”... So following this great answer https://stackoverflow.com/a/42271829/4471672 lets borrow, fix and expand a bit further.巴勃罗·毕加索(Pablo Picasso)被广泛引用,他说过“优秀的艺术家借用,伟大的艺术家偷窃。”......所以遵循这个很好的答案https://stackoverflow.com/a/42271829/4471672让我们进一步借用、修复和扩展。
Imports进口
### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor
### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Data数据
X, y = load_diabetes(as_frame=True, return_X_y=True)
Supplemental function补充功能
### make parameter grid for your GridSearchCV
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above
def make_param_grids(steps, param_grids):
final_params=[]
# Itertools.product will do a permutation such that
# (pca OR svd) AND (svm OR rf) will become ->
# (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
# Step_name and estimator_name should correspond
# i.e preprocessor must be from pca and select.
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).items():
if param == 'object':
# Set actual estimator in pipeline
current_grid[step_name]=[value]
else:
# Set parameters corresponding to above estimator
current_grid[step_name+'__'+param]=value
#Append this dictionary to final params
final_params.append(current_grid)
return final_params
(aka where feature selection class does not return transform, but is of wrapper type) (也就是特征选择类不返回变换,而是包装类型)
### pipelines work from one step to another as long as previous step returns transform
### adjust next steps to fit your problem space
### below in all_params_grid
### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html
pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
'classifier':['rf']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
'with_mean':[True,False]
},
'rf':{'object':RFE(estimator=PoissonRegressor(),
step=1,
verbose=0),
'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to 1 for example to see how it influences accuracy of your grid search
'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
('classifier',RFE(estimator=PoissonRegressor()))])
pipe
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 1)
gs_en_cv.fit(X,y)
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"
(an example where feature selection exposes .transform() (特征选择公开 .transform() 的示例
pipeline_steps = {'transform':['ss'],
'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
'classifier':['pr']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
'with_mean':[True,False]
},
'kbest': {'object': SelectKBest(),
'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
},
'pr':{'object':PoissonRegressor(verbose=2),
'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
'fit_intercept':[True,False], ### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
pipe = Pipeline(steps=[('transform',StandardScaler()),
( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
('classifier',PoissonRegressor())])
pipe
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 1)
gs_en_cv.fit(X,y)
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"
pipeline_steps = {'transform':['ss'],
'classifier':['pr']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
'with_mean':[True,False]
},
'pr':{'object':PoissonRegressor(verbose=2),
'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
'fit_intercept':[True,False], ### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
pipe = Pipeline(steps=[('transform',StandardScaler()),
('classifier',PoissonRegressor())])
pipe
feature_combo = [] ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score
stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 0)
fitted = gs_en_cv.fit(X[list(subset)],y)
score.append(fitted.best_score_) ### append results
params.append(fitted.best_params_) ### append results
feature_combo.append(list(subset)) ### append results
### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
'score':score,
'params':params})
df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]
PS附言
For interactions (I guess you mean like creating new features by combining originals?) I would just create those feature interactions before and include them at your .fit() because otherwise how do you know if for example you get the best interaction features since you are doing your interactions AFTER you selected a subset of them?对于交互(我猜你的意思是像通过组合原件来创建新功能?)我只会在之前创建这些功能交互并将它们包含在你的 .fit() 中,否则你怎么知道你是否获得了自你以来最好的交互功能在您选择其中的一个子集之后进行互动吗? Why not interact them from the start and let gridCV feature selection portion tell you whats best?
为什么不从一开始就与它们进行交互,让 gridCV 特征选择部分告诉你什么是最好的呢?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.