如何使GridSeachCV在我的管道中使用自定义变换器？

Question

If I exclude my custom transformer the GridSearchCV runs fine, but with, it errors. 如果我排除我的自定义变换器GridSearchCV运行正常，但有错误。 Here is a fake dataset: 这是一个虚假的数据集：

import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler

df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
                       "Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4], 
                       "Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})

class MyTransformer(TransformerMixin):

    def transform(self, x, **transform_args):
        x["Number"] = x["Number"].apply(lambda row: row*2)
        return x

    def fit(self, x, y=None, **fit_args):
        return self

x_train = df
y_train = x_train.pop("Label")    

mapper = DataFrameMapper([
    ("Number", MinMaxScaler()),
    ("Letter", LabelBinarizer()),
    ])

pipe = Pipeline([
    ("custom", MyTransformer()),
    ("mapper", mapper),
    ("classifier", RandomForestClassifier()),
    ])


param_grid = {"classifier__min_samples_split":[10,20], "classifier__n_estimators":[2,3,4]}

model_grid = sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=2, scoring="accuracy")

model_grid.fit(x_train, y_train)

and the error is 而错误是

list indices must be integers, not str

How can I make GridSearchCV work while there is a custom transformer in my pipeline? 当我的管道中有自定义变换器时，如何使GridSearchCV工作？

Answer 1

Short version: pandas and scikit-learn's cross validation methods didn't like to talk in that way (in my version, 0.15); 简短版本：pandas和scikit-learn的交叉验证方法不喜欢以这种方式说话（在我的版本中，0.15）; this may be fixed simply by updating scikit-learn to 0.16/stable or 0.17/dev. 这可以通过将scikit-learn更新为0.16 / stable或0.17 / dev来解决。

The GridSearchCV class validates the data and converts it to an array (so that it can perform CV splits correctly). GridSearchCV类验证数据并将其转换为数组（以便它可以正确执行CV拆分）。 So you don't get to use Pandas DataFrame features inside of built-in cross validation loops. 因此，您无法在内置交叉验证循环中使用Pandas DataFrame功能。

You will have to make your own cross-validation routines that don't do the validation if you want to do this kind of thing. 如果要执行此类操作，则必须创建自己的不进行验证的交叉验证例程。

EDIT: This is my experience with scikit-learn's cross validation routines. 编辑：这是我使用scikit-learn的交叉验证程序的经验。 It is why sklearn-pandas provides cross_val_score. 这就是sklearn-pandas提供cross_val_score的原因。 However, so far as I can tell, GridSearchCV is not specialized by sklearn-pandas; 但是，据我所知，GridSearchCV并不是专门用于sklearn-pandas; your import of it accidentally imports the default sklearn version. 导入它时会意外导入默认的sklearn版本。 Therefore, you may have to implement you own grid search using ParameterGrid and sklearn-pandas's cross_val_score. 因此，您可能必须使用ParameterGrid和sklearn-pandas的cross_val_score实现自己的网格搜索。

Answer 2

I know this answer comes rather late, but I've encountered the same behavior with sklearn and BaseSearchCV derivative classes. 我知道这个答案来得相当晚，但我遇到了与sklearn和BaseSearchCV派生类相同的行为。 The problem actually seems to stem from the _PartitionIterator class in the sklearn cross_validation module, as it makes the assumption that everything emitted from every TransformerMixin class in the pipeline is going to be array-like, and thus it generates slices of indices that are used to index incoming X args in a array-like manner. 问题实际上似乎源于sklearn cross_validation模块中的_PartitionIterator类，因为它假设管道中每个TransformerMixin类发出的所有内容都是类似数组的，因此它会生成用于以类似数组的方式索引传入的X args。 Here's the __iter__ method: 这是__iter__方法：

def __iter__(self):
    ind = np.arange(self.n)
    for test_index in self._iter_test_masks():
        train_index = np.logical_not(test_index)
        train_index = ind[train_index]
        test_index = ind[test_index]
        yield train_index, test_index

And the BaseSearchCV grid search metaclass calls cross_validation's _fit_and_score , which uses a method called safe_split . BaseSearchCV网格搜索元类调用cross_validation的_fit_and_score ，它使用一个名为safe_split的方法。 Here's the relevant line: 这是相关的一行：

X_subset = [X[idx] for idx in indices]

This will absolutely produce unexpected results if X is a pandas dataframe, which you're emitting from your transform function. 如果X是一个pandas数据帧，那么这绝对会产生意想不到的结果，这是你从transform函数中发出的。

There are two ways I've found to fix this: 我发现有两种方法可以解决这个问题：

Make sure to return an array from your transformer: 确保从变换器返回一个数组：
```
 return x.as_matrix() 
```
This is a hack. 这是一个黑客。 If the pipe of transformers demands the input to the next transformer be a DataFrame, as was my case, you can write a utilities script that is essentially the same as the sklearn grid_search module, but includes some clever validation methods that are called in the _fit method of the BaseSearchCV class: 如果变换器管道要求输入到下一个变换器是一个DataFrame，就像我的情况一样，你可以编写一个与sklearn grid_search模块基本相同的实用程序脚本，但是包含一些在_fit中调用的聪明的验证方法BaseSearchCV类的方法：
```
 def _validate_X(X): """Returns X if X isn't a pandas frame, otherwise the underlying matrix in the frame. """ return X if not isinstance(X, pd.DataFrame) else X.as_matrix() def _validate_y(y): """Returns y if y isn't a series, otherwise the array""" if y is None: return y # if it's a series elif isinstance(y, pd.Series): return np.array(y.tolist()) # if it's a dataframe: elif isinstance(y, pd.DataFrame): # check it's X dims if y.shape[1] > 1: raise ValueError('matrix provided as y') return y[y.columns[0]].tolist() # bail and let the sklearn function handle validation return y 
```

As an example, here's my "custom grid_search module" . 举个例子，这是我的“自定义grid_search模块” 。

如何使GridSeachCV在我的管道中使用自定义变换器？

问题描述

2 个解决方案

解决方案1
0 2015-07-02 18:50:26

解决方案2
0 已采纳 2016-05-24 19:09:10

如何使GridSeachCV在我的管道中使用自定义变换器？

问题描述

2 个解决方案

解决方案1 0 2015-07-02 18:50:26

解决方案2 0 已采纳 2016-05-24 19:09:10

解决方案1
0 2015-07-02 18:50:26

解决方案2
0 已采纳 2016-05-24 19:09:10