[英]Trouble changing imputer strategy in scikit-learn pipeline
I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.我正在尝试将 GridSearchCV 用于 select 最佳插补策略,但我在这样做时遇到了麻烦。
First, I have a data preparation pipeline for numerical and categorical columns-首先,我有一个用于数值和分类列的数据准备管道-
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
OneHotEncoder(sparse=False, handle_unknown='ignore'))
preprocessing = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
Next, I have created a pipeline to train a support vector machine model with feature selection.接下来,我创建了一个管道来训练具有特征选择的支持向量机 model。
from sklearn.feature_selection import SelectFromModel
model = Pipeline([
("preprocess", preprocessing),
("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
])
Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV现在,我正在尝试查看哪种估算器策略最适合使用 GridSearchCV 估算数值列的缺失值
grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
['mean','median','most_frequent']}
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
This is where I am getting the error.这是我得到错误的地方。 The full pipeline looks like this -
完整的管道如下所示 -
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['longitude', 'latitude',
'housing_median_age',
'total_rooms',
'total_bedrooms',
'population', 'households',
'median_income']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='NA',
strategy='constant')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['ocean_proximity'])])),
('feature_select',
SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
('regressor', SVR(C=30000.0, gamma=0.3))])
Can anyone tell me what I need to change in the grid search to make it work?谁能告诉我我需要在网格搜索中进行哪些更改才能使其正常工作?
The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try.指定参数的方式是通过字典,该字典将估计器/变压器的名称和要更改的参数名称映射到要尝试的参数。 If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore.
如果您有一个管道或管道的管道,则名称是其所有父级的名称加上双下划线。 So for your case, it looks like
所以对于你的情况,它看起来像
gird = {
"preprocess__num__simpleimputer__strategy":['median']
}
simpleimputer is simply the name that was automatically assigned by make_pipeline. simpleimputer 只是由 make_pipeline 自动分配的名称。
However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.但是,我认为您的代码中还有其他问题,例如 fill_value='NA' 不正确并且实际上不需要,因为它不是要填充的错误,而是填充缺失值所需的值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.