简体   繁体   English

如何使用 GridSearchCV 比较多个模型以及 python 中的管道和超参数调整

[英]How to use GridSearchCV for comparing multiple models along with pipeline and hyper-parameter tuning in python

I am using two estimators, Randomforest and SVM我正在使用两个估计器,随机森林和 SVM

random_forest_pipeline=Pipeline([   
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('svm',LinearSVC())
])

I want to first vectorize the data and then use the estimator, I was going through this online tutorial .我想首先对数据进行矢量化,然后使用估计器,我正在阅读这个在线教程 then I use the hyper parameter as follows然后我使用超参数如下

parameters=[
    {
        'vectorizer__max_features':[500,1000,1500],
        'random_forest__min_samples_split':[50,100,250,500]
    },
    {
        'vectorizer__max_features':[500,1000,1500],
        'svm__C':[1,3,5]
    }
]

and passed to the GridSearchCV并传递给GridSearchCV

pipelines=[random_forest_pipeline,svm_pipeline]
grid_search=GridSearchCV(pipelines,param_grid=parameters,cv=3,n_jobs=-1)
grid_search.fit(x_train,y_train)

but, when I run the code I get an error但是,当我运行代码时出现错误

TypeError: estimator should be an estimator implementing 'fit' method TypeError:估计器应该是实现“拟合”方法的估计器

Don't know why am I getting this error不知道为什么我会收到这个错误

It is quite possible to do it in a single Pipeline / GridSearchCV , based on an example here .根据此处的示例,很有可能在单个Pipeline / GridSearchCV中执行此操作。

You just have to explicitly mention the scoring method for the pipeline since we are not declaring the final estimator initially.您只需明确提及管道的scoring方法,因为我们最初并未声明最终估算器。

Example:例子:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC


my_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('clf', 'passthrough')
])


parameters = [
    {
        'vectorizer__max_features': [500, 1000],
        'clf':[RandomForestClassifier()],
        'clf__min_samples_split':[50, 100,]
    },
    {
        'vectorizer__max_features': [500, 1000],
        'clf':[LinearSVC()],
        'clf__C':[1, 3]
    }
]

grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=3, n_jobs=-1, scoring='accuracy')
grid_search.fit(X, y)

grid_search.best_params_

> # {'clf': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
#                         criterion='gini', max_depth=None, max_features='auto',
#                         max_leaf_nodes=None, max_samples=None,
#                         min_impurity_decrease=0.0, min_impurity_split=None,
#                         min_samples_leaf=1, min_samples_split=100,
#                         min_weight_fraction_leaf=0.0, n_estimators=100,
#                         n_jobs=None, oob_score=False, random_state=None,
#                         verbose=0, warm_start=False),
#  'clf__min_samples_split': 100,
#  'vectorizer__max_features': 1000}




pd.DataFrame(grid_search.cv_results_)[['param_vectorizer__max_features',
                                       'param_clf__min_samples_split',
                                       'param_clf__C','mean_test_score',
                                       'rank_test_score']]

在此处输入图像描述

The problem is the pipelines=[random_forest_pipeline,svm_pipeline] that is a list not having the fit method.问题是pipelines=[random_forest_pipeline,svm_pipeline]这是一个没有fit方法的列表。

Even if you could make it work this way, at some point the 'random_forest__min_samples_split':[50,100,250,500] would be passed in the svm_pipeline and this would raise an error.即使您可以使其以这种方式工作,在某些时候'random_forest__min_samples_split':[50,100,250,500]也会在svm_pipeline中传递,这会引发错误。

ValueError: Invalid parameter svm for estimator Pipeline ValueError:估计器管道的参数 svm 无效

You cannot mix this way 2 pipelines because at some point you request the svm_pipeline to be evaluated using the values of random_forest__min_samples_split and this is INVALID.您不能以这种方式混合 2 个管道,因为在某些时候您请求使用svm_pipeline的值评估random_forest__min_samples_split ,这是无效的。


Solution: Fit a GridSearch object for the Random forest model and another GridSearch object for the SVC model解决方案: Fit a GridSearch object for the Random forest model and another GridSearch object for the SVC model

pipelines=[random_forest_pipeline,svm_pipeline]

grid_search_1=GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
grid_search_1.fit(X,y)

grid_search_2=GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
grid_search_2.fit(X,y)

Full code:完整代码:

random_forest_pipeline=Pipeline([   
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('svm',LinearSVC())
])

parameters=[
    {
        'vectorizer__max_features':[500,1000,1500],
        'random_forest__min_samples_split':[50,100,250,500]
    },
    {
        'vectorizer__max_features':[500,1000,1500],
        'svm__C':[1,3,5]
    }
]

pipelines=[random_forest_pipeline,svm_pipeline]

# gridsearch only for the Random Forest model
grid_search_1 =GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
grid_search_1.fit(X,y)

# gridsearch only for the SVC model
grid_search_2 =GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
grid_search_2.fit(X,y)

EDIT编辑

If you explicitly define the models into the param_grid list then it is possible based on the documentation.如果您将模型明确定义到param_grid列表中,则可以根据文档进行。

Link: https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html?highlight=pipeline%20gridsearch链接: https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html?highlight=pipeline%20gridsearch

Code from doc:来自文档的代码:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

print(__doc__)

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', 'passthrough'),
    ('classify', LinearSVC(dual=False, max_iter=10000))
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 GridSearchCV 进行神经网络的超参数调优 - Hyper-parameter Tuning Using GridSearchCV for Neural Network 机器学习的超参数调优 model - Hyper-parameter Tuning for a machine learning model 如何使用smac进行卷积神经网络的超参数优化? - How to use smac for hyper-parameter optimization of Convolution Neural Network? 在超参数调整期间,简单参数是否也会更改 - Does the simple parameters also change during Hyper-parameter tuning 在单个GridSearchCV运行中比较多个管道步骤时如何设置参数网格的困惑 - Confusion on how to set parameter grid when comparing multiple pipeline steps in a single GridSearchCV run 使用 Keras-tuner 进行超参数调整时关于“准确性”的错误 - Error regarding "accuracy" in hyper-parameter tuning using Keras-tuner 如何将 GridsearchCV 与管道和多个分类器一起使用? - How to use GridsearchCV with a pipeline and multiple classifiers? 使用前馈神经网络进行超参数调整和过拟合 - Mini-Batch Epoch 和交叉验证 - Hyper-parameter tuning and Over-fitting with Feed-Forward Neural Network - Mini-Batch Epoch and Cross Validation Optuna 超参数优化:定义目标之外的超参数空间 function - Optuna hyper-parameter optimization: define hyper-parameter space outside the objective function GridSearchCV 用于估计最佳调整参数 - GridSearchCV for estimating optimal tuning parameter
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM