scikit-learn管道

Question

我的（iid）數據集中的每個樣本如下所示：
x = [a_1，a_2 ... a_N，b_1，b_2 ... b_M]

我也有每個樣本的標簽（這是監督學習）

a特征非常稀疏（即詞袋表示），而b特征是密集的（整數，有~45個）

我正在使用scikit-learn，我想將GridSearchCV與管道一起使用。

問題：是否可以在功能類型a上使用一個CountVectorizer，在功能類型b上使用另一個CountVectorizer？

我想要的可以被認為是：

pipeline = Pipeline([
    ('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
    ('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
    ('clf', SGDClassifier()), #will use all features to classify
])

parameters = {
    'vect1__max_df': (0.5, 0.75, 1.0),       # type a features only
    'vect1__ngram_range': ((1, 1), (1, 2)),  # type a features only
    'vect2__max_df': (0.5, 0.75, 1.0),       # type b features only
    'vect2__ngram_range': ((1, 1), (1, 2)),  # type b features only
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__n_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)

那可能嗎？

@Andreas Mueller提出了一個好主意。 但是，我想保留原始的非選擇功能...因此，我無法預先告知管道中每個階段的列索引（在管道開始之前）。

例如，如果我設置CountVectorizer(max_df=0.75) ，它可能會減少一些術語，原始列索引將更改。

謝謝

Answer 1

不幸的是，目前這還不是很好。 您需要使用FeatureUnion連接到各種功能，並且每個功能中的變換器都需要選擇功能並對其進行轉換。 一種方法是創建一個變換器的管道，選擇列（您需要自己編寫）和CountVectorizer。 有一個例子在這里做類似的事情。 該示例實際上將要素分離為字典中的不同值，但您不需要這樣做。 另請參閱選擇包含所需變換器代碼的列的相關問題。

使用當前代碼看起來像這樣：

make_pipeline(
    make_union(
        make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
        make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
    SGDClassifier())

scikit-learn管道

問題描述

1 個解決方案

解決方案1
5 已采納 2015-06-01 13:33:12

scikit-learn管道

問題描述

1 個解決方案

解決方案1 5 已采納 2015-06-01 13:33:12

解決方案1
5 已采納 2015-06-01 13:33:12