简体   繁体   English

如何使用 RandomizedSearchCV 正确实现 StratifiedKFold

[英]How to correctly implement StratifiedKFold with RandomizedSearchCV

I am trying to implement a Random Forest classifier using both stratifiedKFold and RandomizedSearchCV.我正在尝试使用分层 KFold 和 RandomizedSearchCV 来实现随机森林分类器。 The thing is that I can see that the "cv" parameter of RandomizedSearchCV is used to do the cross validation.问题是我可以看到 RandomizedSearchCV 的“cv”参数用于进行交叉验证。 But I do not understand how is this possible.但我不明白这怎么可能。 I need to have the X_train, X_test, y_train, y_test data sets and, if I try to implement my code the way I have seen it, it is not possible to have the four sets... I have seen things like the following:我需要 X_train、X_test、y_train、y_test 数据集,如果我尝试按照我所看到的方式实现我的代码,则不可能拥有这四组......我看到过如下内容:

cross_val = StratifiedKFold(n_splits=split_number)
clf = RandomForestClassifier()
n_iter_search = 45
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                               n_iter=n_iter_search,
                               scoring=Fscorer, cv=cross_val,
                               n_jobs=-1)
random_search.fit(X, y) 

But the thing is that I need to fit my data with the X_train and y_train data sets and predict the results with X_train and X_test data sets to be able to compare the results in the training data and in the testing data to evaluate the possible overfitting... This is a piece of my code, I know that I am doing the work twice but I dont know how to use correctly the stratifiedKfold and RandomizedSearchCV:但问题是我需要用 X_train 和 y_train 数据集拟合我的数据,并用 X_train 和 X_test 数据集预测结果,以便能够比较训练数据和测试数据中的结果,以评估可能的过度拟合。 .. 这是我的一段代码,我知道我正在做两次工作,但我不知道如何正确使用分层 Kfold 和 RandomizedSearchCV:

...
cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,
                                verbose=2, random_state=42, n_jobs = -1)
for train_index, test_index in index_iterator:
    X_train, X_test = np.array(features_dataframe)[train_index], np.array(features_dataframe)[test_index]
    y_train, y_test = np.array(classes_dataframe)[train_index], np.array(classes_dataframe)[test_index]
    clf_random.fit(X_train, y_train)
    clf_list.append(clf_random)
    y_train_pred = clf_random.predict(X_train)
    train_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel())*100
    train_accuracy_list.append(train_accuracy)
    y_test_pred = clf_random.predict(X_test)
    test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel())*100

    confusion_matrix = pd.crosstab(y_test.ravel(), y_test_pred.ravel(), rownames=['Actual Cultives'],
                                   colnames=['Predicted Cultives'])
...

As you can see I am doing the work of the stratified K fold twice, (or that is what I think I am doing...) only to be able to get the four data sets which I need to evaluate my system.正如你所看到的,我做了两次分层 K 折叠的工作,(或者这就是我认为我正在做的......)只是为了能够获得我需要评估我的系统的四个数据集。 Thank you in advance for your help.预先感谢您的帮助。

RandomizedSearchCV is used to find best parameters for classifier. RandomizedSearchCV 用于寻找分类器的最佳参数。 It chooses randomized parameters and fits your model with them.它选择随机参数并用它们拟合您的模型。 After that it needs to evaluate this model and you can choose strategy, it is cv parameter.之后它需要评估这个模型,你可以选择策略,它是 cv 参数。 Then with another parameters.然后用另一个参数。 You don't need to do it twice.你不需要做两次。 You can just write:你可以只写:

cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,
                                verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)

And all will be done automaticly.而这一切都将自动完成。 U should look at parameters like cv_results_ or best_estimator_ after that.之后你应该查看像 cv_results_ 或 best_estimator_ 这样的参数。 If u don't want to search the best parameters for classifier - u shouldn't use RandomizedSearchCV.如果你不想为分类器搜索最佳参数 - 你不应该使用 RandomizedSearchCV。 It just to do that.它只是为了做到这一点。

And here is a good example .这是一个很好的例子

UPD: Try to do this: UPD:尝试这样做:

clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, 
                                score = 'accuracy', n_iter = 100, 
                                cv = StratifiedKFold(n_splits=split_number),
                                verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)
print(clf_random.cv_results_)

Is this what u want?这是你想要的吗?

The cv_results_ shows u accuracy for train and test for all splits and for all itarations. cv_results_ 显示了所有拆分和所有迭代的训练和测试的准确度。

params = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
cross_val = StratifiedKFold(n_splits=5)
index_iterator = cross_val.split(X_train, y_train)
clf = RandomForestClassifier()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = params, n_iter =100, cv = cross_val,
                            verbose=2, random_state=42, n_jobs = -1,scoring='roc_auc')
clf_random.fit(X, y)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM