如何构建管道以细粒度方式找到每列的最佳预处理？

Question

在 sklearn 中，我们可以使用管道中的列转换器将预处理选择应用于特定列，如下所示：

import pandas as pd
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, StandardScaler, ...
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

# this is my x_data
x_data = pd.DataFrame(..., columns=['Variable1', 'Variable2', 'Variable3'])

pipeline = Pipeline(steps=[('preprocessing1', make_column_transformer((StandardScaler(), ['Variable1']),
                                                                       remainder='passthrough')),
                           ('preprocessing2', make_column_transformer((MaxAbsScaler(), ['Variable2']),
                                                                       remainder='passthrough')),
                           ('preprocessing3', make_column_transformer((MinMaxScaler(), ['Variable3']),
                                                                       remainder='passthrough')),
                           ('clf', MLPClassifier(...)
                          ]
                   )

然后我们将按照以下方式运行 GridSearchCV：

params = [{'preprocessing1': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
           'preprocessing2': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
           'preprocessing3': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
           'ann__hidden_layer_sizes': [(100,), (200,)],
           'ann__solver': ['adam', 'lbfs', 'sgd'],
            ...
          }]

cv = GridSearch(pipeline, params, cv=10, verbose=1, n_jobs=-1, refit=True)

我想做的是找到每个预测变量的最佳预处理，因为通常对所有预测变量进行一个预处理并不是最好的。

Answer 1

管道中的命名约定是使用双下划线__来分隔步骤及其参数。

您可以使用pipeline.get_params()查看管道的不同参数及其值。

在您的情况下，参数preprocessing1__standardscaler引用为管道的第一步定义的缩放预处理，这是应该在GridSearchCV期间设置的参数。

下面的示例说明了如何执行此操作：

from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier

X, y = make_classification(
    n_features=3, n_informative=3, n_redundant=0, random_state=42
)

pipeline = Pipeline(
    steps=[
        ("preprocessing1", make_column_transformer((StandardScaler(), [0]), remainder="passthrough")),
        ("preprocessing2", make_column_transformer((StandardScaler(), [1]), remainder="passthrough")),
        ("preprocessing3", make_column_transformer((StandardScaler(), [2]), remainder="passthrough")),
        ("clf", MLPClassifier()),
    ]
)

param_grid = {
    "preprocessing1__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
    "preprocessing2__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
    "preprocessing3__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=10, verbose=1, n_jobs=-1)
grid_search.fit(X, y)
grid_search.best_params_

这将返回以下 output：

{'preprocessing1__standardscaler': MinMaxScaler(),
 'preprocessing2__standardscaler': StandardScaler(),
 'preprocessing3__standardscaler': MaxAbsScaler()}

如何构建管道以细粒度方式找到每列的最佳预处理？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-12-02 08:58:23

如何构建管道以细粒度方式找到每列的最佳预处理？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-12-02 08:58:23

解决方案1
1 已采纳 2021-12-02 08:58:23