簡體   English   中英

scikit-learn FeatureUnion gridsearch over features of features

[英]scikit-learn FeatureUnion gridsearch over subsets of features

如何在scikit中使用FeatureUnion學習,以便Gridsearch可以選擇處理其部分?

下面的代碼工作並設置一個FeatureUnion,其中包含用於單詞的TfidfVectorizer和用於字符的TfidfVectorizer。

在進行Gridsearch時,除了測試已定義的參數空間外,我還想測試'vect__wordvect'及其ngram_range參數(沒有用於字符的TfidfVectorizer),也只測試帶有小寫參數True的'vect__lettervect'和False,另一個TfidfVectorizer被禁用。

編輯:基於maxymoo建議的完整代碼示例。

如何才能做到這一點?

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import fetch_20newsgroups

# setup the featureunion
wordvect = TfidfVectorizer(analyzer='word')
lettervect = CountVectorizer(analyzer='char')
featureunionvect = FeatureUnion([("lettervect", lettervect), ("wordvect", wordvect)])

# setup the pipeline
classifier = LogisticRegression(class_weight='balanced')
pipeline = Pipeline([('vect', featureunionvect), ('classifier', classifier)])

# gridsearch parameters 
parameters = {
            'vect__wordvect__ngram_range': [(1, 1), (1, 2)],  # commenting out these two lines
            'vect__lettervect__lowercase': [True, False],     # runs, but there is no parameterization anymore
            'vect__transformer_list': [[('wordvect', wordvect)],
                                        [('lettervect', lettervect)],
                                        [('wordvect', wordvect), ('lettervect', lettervect)]]}
gs_clf = GridSearchCV(pipeline, parameters)

# data
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])

# gridsearch CV
gs_clf = GridSearchCV(pipeline, parameters)
gs_clf = gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
for score in gs_clf.grid_scores_:
    print "gridsearch scores: ", score

FeatureUnion有一個名為transformer_list的參數,您可以使用它來進行網格搜索; 所以在你的情況下你的網格搜索參數會變成

parameters = {'vect__wordvect__ngram_range': [(1, 1), (1, 2)],
              'vect__lettervect__lowercase': [True, False],
              'vect__transformer_weights': [{"lettervect":1,"wordvect":0}, 
                                            {"lettervect":0,"wordvect":1}, 
                                            {"lettervect":1,"wordvect":1}]}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM