简体   繁体   中英

How do to parameter tuning/cross-validation with Sklearn's pipeline?

I have just discovered Sklearn's pipeline feature which I think will be useful for sentiment analysis. I have defined my pipeline in the following way:

Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(),
                         stop_words='english',
                         strip_accents='unicode',
                         max_df=0.5)),
          ('clf', MultinomialNB())

However, by defining it in the way above, I am not allowing for parameter tuning. Let's say I want to look at the following max_dfs=[0,3,0.4,0.5,0.6,0.7] and also the following n_gram ranges = [(1,1), (1,2), (2,2), and use cross validation to find the best combination. Is there a way to specify this in our outside the pipeline so it knows to consider all possible combinations? If so, how would this be done?

Thank you so much for your guidance and help!

you can set the parameter for individual steps in pipeline by using the set_param function, and passing the key_name as <stepname>__<paramname> (joined using double underscore).

This can be combined with GridSearchCV to identify the combination of parameters which maximize the score function from the give values

p = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(),
                         stop_words='english',
                         strip_accents='unicode',
                         max_df=0.5)),
          ('clf', MultinomialNB())
g = GridSearchCV(p, 
        param_grid={
              'vect__max_dfs':[0,3,0.4,0.5,0.6,0.7], 'vect__ngram_range':  [(1,1), (1,2), (2,2)]})
g.fit(X, y)
g.best_estimator_

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM