简体   繁体   English

scikit-learn管道

[英]scikit-learn pipeline


Each sample in my (iid) dataset looks like this: 我的(iid)数据集中的每个样本如下所示:
x = [a_1,a_2...a_N,b_1,b_2...b_M] x = [a_1,a_2 ... a_N,b_1,b_2 ... b_M]

I also have the label of each sample (This is supervised learning) 我也有每个样本的标签(这是监督学习)

The a features are very sparse (namely bag-of-words representation), while the b features are dense (integers,there are ~45 of those) a特征非常稀疏(即词袋表示),而b特征是密集的(整数,有~45个)

I am using scikit-learn, and I want to use GridSearchCV with pipeline. 我正在使用scikit-learn,我想将GridSearchCV与管道一起使用。

The question: is it possible to use one CountVectorizer on features type a and another CountVectorizer on features type b ? 问题:是否可以在功能类型a上使用一个CountVectorizer,在功能类型b上使用另一个CountVectorizer?

What I want can be thought of as: 我想要的可以被认为是:

pipeline = Pipeline([
    ('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
    ('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
    ('clf', SGDClassifier()), #will use all features to classify
])

parameters = {
    'vect1__max_df': (0.5, 0.75, 1.0),       # type a features only
    'vect1__ngram_range': ((1, 1), (1, 2)),  # type a features only
    'vect2__max_df': (0.5, 0.75, 1.0),       # type b features only
    'vect2__ngram_range': ((1, 1), (1, 2)),  # type b features only
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__n_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)

Is that possible? 那可能吗?

A nice idea was presented by @Andreas Mueller. @Andreas Mueller提出了一个好主意 However, I want to keep the original non-chosen features as well... therefore, I cannot tell the column index for each phase at the pipeline upfront (before the pipeline begins). 但是,我想保留原始的非选择功能...因此,我无法预先告知管道中每个阶段的列索引(在管道开始之前)。

For example, if I set CountVectorizer(max_df=0.75) , it may reduce some terms, and the original column index will change. 例如,如果我设置CountVectorizer(max_df=0.75) ,它可能会减少一些术语,原始列索引将更改。

Thanks 谢谢

Unfortunately, this is currently not as nice as it could be. 不幸的是,目前这还不是很好。 You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them. 您需要使用FeatureUnion连接到各种功能,并且每个功能中的变换器都需要选择功能并对其进行转换。 One way to do that is to make a pipeline of a transformer that selects the columns (you need to write that yourself) and the CountVectorizer. 一种方法是创建一个变换器的管道,选择列(您需要自己编写)和CountVectorizer。 There is an example that does something similar here . 有一个例子在这里做类似的事情 That example actually separates the features as different values in a dictionary, but you don't need to do that. 该示例实际上将要素分离为字典中的不同值,但您不需要这样做。 Also have a look at the related issue for selecting columns which contains code for the transformer that you need. 另请参阅选择包含所需变换器代码的相关问题

It would looks something like this with the current code: 使用当前代码看起来像这样:

make_pipeline(
    make_union(
        make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
        make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
    SGDClassifier())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM