简体   繁体   English

麻烦更改 scikit-learn 管道中的 imputer 策略

[英]Trouble changing imputer strategy in scikit-learn pipeline

I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.我正在尝试将 GridSearchCV 用于 select 最佳插补策略,但我在这样做时遇到了麻烦。

First, I have a data preparation pipeline for numerical and categorical columns-首先,我有一个用于数值和分类列的数据准备管道-

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline

num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
                         OneHotEncoder(sparse=False, handle_unknown='ignore'))

preprocessing = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

Next, I have created a pipeline to train a support vector machine model with feature selection.接下来,我创建了一个管道来训练具有特征选择的支持向量机 model。

from sklearn.feature_selection import SelectFromModel

model = Pipeline([
    ("preprocess", preprocessing),
    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
])

Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV现在,我正在尝试查看哪种估算器策略最适合使用 GridSearchCV 估算数值列的缺失值

grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy": 
        ['mean','median','most_frequent']}
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

This is where I am getting the error.这是我得到错误的地方。 The full pipeline looks like this -完整的管道如下所示 -

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['longitude', 'latitude',
                                                   'housing_median_age',
                                                   'total_rooms',
                                                   'total_bedrooms',
                                                   'population', 'households',
                                                   'median_income']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='NA',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['ocean_proximity'])])),
                ('feature_select',
                 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
                ('regressor', SVR(C=30000.0, gamma=0.3))])

Can anyone tell me what I need to change in the grid search to make it work?谁能告诉我我需要在网格搜索中进行哪些更改才能使其正常工作?

The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try.指定参数的方式是通过字典,该字典将估计器/变压器的名称和要更改的参数名称映射到要尝试的参数。 If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore.如果您有一个管道或管道的管道,则名称是其所有父级的名称加上双下划线。 So for your case, it looks like所以对于你的情况,它看起来像

gird = {
    "preprocess__num__simpleimputer__strategy":['median']
}

simpleimputer is simply the name that was automatically assigned by make_pipeline. simpleimputer 只是由 make_pipeline 自动分配的名称。

However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.但是,我认为您的代码中还有其他问题,例如 fill_value='NA' 不正确并且实际上不需要,因为它不是要填充的错误,而是填充缺失值所需的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM