[英]How do I make GridSeachCV work with a custom transformer in my pipeline?
If I exclude my custom transformer the GridSearchCV runs fine, but with, it errors. 如果我排除我的自定义变换器GridSearchCV运行正常,但有错误。 Here is a fake dataset:
这是一个虚假的数据集:
import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler
df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
"Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4],
"Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})
class MyTransformer(TransformerMixin):
def transform(self, x, **transform_args):
x["Number"] = x["Number"].apply(lambda row: row*2)
return x
def fit(self, x, y=None, **fit_args):
return self
x_train = df
y_train = x_train.pop("Label")
mapper = DataFrameMapper([
("Number", MinMaxScaler()),
("Letter", LabelBinarizer()),
])
pipe = Pipeline([
("custom", MyTransformer()),
("mapper", mapper),
("classifier", RandomForestClassifier()),
])
param_grid = {"classifier__min_samples_split":[10,20], "classifier__n_estimators":[2,3,4]}
model_grid = sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=2, scoring="accuracy")
model_grid.fit(x_train, y_train)
and the error is 而错误是
list indices must be integers, not str
How can I make GridSearchCV work while there is a custom transformer in my pipeline? 当我的管道中有自定义变换器时,如何使GridSearchCV工作?
Short version: pandas and scikit-learn's cross validation methods didn't like to talk in that way (in my version, 0.15); 简短版本:pandas和scikit-learn的交叉验证方法不喜欢以这种方式说话(在我的版本中,0.15); this may be fixed simply by updating scikit-learn to 0.16/stable or 0.17/dev.
这可以通过将scikit-learn更新为0.16 / stable或0.17 / dev来解决。
The GridSearchCV
class validates the data and converts it to an array (so that it can perform CV splits correctly). GridSearchCV
类验证数据并将其转换为数组(以便它可以正确执行CV拆分)。 So you don't get to use Pandas DataFrame features inside of built-in cross validation loops. 因此,您无法在内置交叉验证循环中使用Pandas DataFrame功能。
You will have to make your own cross-validation routines that don't do the validation if you want to do this kind of thing. 如果要执行此类操作,则必须创建自己的不进行验证的交叉验证例程。
EDIT: This is my experience with scikit-learn's cross validation routines. 编辑:这是我使用scikit-learn的交叉验证程序的经验。 It is why sklearn-pandas provides cross_val_score.
这就是sklearn-pandas提供cross_val_score的原因。 However, so far as I can tell, GridSearchCV is not specialized by sklearn-pandas;
但是,据我所知,GridSearchCV并不是专门用于sklearn-pandas; your import of it accidentally imports the default sklearn version.
导入它时会意外导入默认的sklearn版本。 Therefore, you may have to implement you own grid search using ParameterGrid and sklearn-pandas's cross_val_score.
因此,您可能必须使用ParameterGrid和sklearn-pandas的cross_val_score实现自己的网格搜索。
I know this answer comes rather late, but I've encountered the same behavior with sklearn and BaseSearchCV
derivative classes. 我知道这个答案来得相当晚,但我遇到了与sklearn和
BaseSearchCV
派生类相同的行为。 The problem actually seems to stem from the _PartitionIterator
class in the sklearn cross_validation module, as it makes the assumption that everything emitted from every TransformerMixin
class in the pipeline is going to be array-like, and thus it generates slices of indices that are used to index incoming X
args in a array-like manner. 问题实际上似乎源于sklearn cross_validation模块中的
_PartitionIterator
类,因为它假设管道中每个TransformerMixin
类发出的所有内容都是类似数组的,因此它会生成用于以类似数组的方式索引传入的X
args。 Here's the __iter__
method: 这是
__iter__
方法:
def __iter__(self):
ind = np.arange(self.n)
for test_index in self._iter_test_masks():
train_index = np.logical_not(test_index)
train_index = ind[train_index]
test_index = ind[test_index]
yield train_index, test_index
And the BaseSearchCV
grid search metaclass calls cross_validation's _fit_and_score
, which uses a method called safe_split
. BaseSearchCV
网格搜索元类调用cross_validation的_fit_and_score
,它使用一个名为safe_split
的方法。 Here's the relevant line: 这是相关的一行:
X_subset = [X[idx] for idx in indices]
This will absolutely produce unexpected results if X is a pandas dataframe, which you're emitting from your transform
function. 如果X是一个pandas数据帧,那么这绝对会产生意想不到的结果,这是你从
transform
函数中发出的。
There are two ways I've found to fix this: 我发现有两种方法可以解决这个问题:
Make sure to return an array from your transformer: 确保从变换器返回一个数组:
return x.as_matrix()
This is a hack. 这是一个黑客。 If the pipe of transformers demands the input to the next transformer be a DataFrame, as was my case, you can write a utilities script that is essentially the same as the sklearn
grid_search
module, but includes some clever validation methods that are called in the _fit
method of the BaseSearchCV
class: 如果变换器管道要求输入到下一个变换器是一个DataFrame,就像我的情况一样,你可以编写一个与sklearn
grid_search
模块基本相同的实用程序脚本,但是包含一些在_fit
中调用的聪明的验证方法BaseSearchCV
类的方法:
def _validate_X(X): """Returns X if X isn't a pandas frame, otherwise the underlying matrix in the frame. """ return X if not isinstance(X, pd.DataFrame) else X.as_matrix() def _validate_y(y): """Returns y if y isn't a series, otherwise the array""" if y is None: return y # if it's a series elif isinstance(y, pd.Series): return np.array(y.tolist()) # if it's a dataframe: elif isinstance(y, pd.DataFrame): # check it's X dims if y.shape[1] > 1: raise ValueError('matrix provided as y') return y[y.columns[0]].tolist() # bail and let the sklearn function handle validation return y
As an example, here's my "custom grid_search module" . 举个例子, 这是我的“自定义grid_search模块” 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.