scikit-learn随机森林：严重过度拟合？

Question

I am attempting to apply knn, logistic regression, decision tree, and random forest to predict a binary response variable. 我正在尝试应用knn，逻辑回归，决策树和随机森林来预测二进制响应变量。

The former three produce seemingly reasonable accuracy rates, but running the random forest algorithm produces an accuracy rate of over 99% (1127/1128 correct). 前三个产生看似合理的准确率，但是运行随机森林算法产生的准确率超过99％（正确的1127/1128）。

vote_lst = list(range(1, 101))
rf_cv_scores = []
for tree_count in vote_lst:
    maple = RandomForestClassifier(n_estimators = tree_count, random_state = 1618)
    scores = cross_val_score(maple, x, y, cv = 10, scoring = 'accuracy') # 10-fold CV
    rf_cv_scores.append(scores.mean()) 

# find minimum error's index (i.e. optimal num. of estimators)
rf_MSE = [1 - x for x in rf_cv_scores]
min_error = rf_MSE[0]
for i in range(len(rf_MSE)):
    min_error = min_error
    if rf_MSE[i] < min_error:
        rf_min_index = i
        min_error = rf_MSE[i]
print(rf_min_index + 1) # error minimized w/ 66 estimators

I tuned the rf algorithm hyperparameter n_estimators using the code above. 我使用上面的代码调整了RF算法超参数n_estimators 。 Then, I fit the model on my data: 然后，将模型拟合到我的数据中：

# fit random forest classifier
forest_classifier = RandomForestClassifier(n_estimators = rf_min_index + 1, random_state = 1618)
forest_classifier.fit(x, y)

# predict test set
y_pred_forest = forest_classifier.predict(x)

I'm concerned that some drastic overfitting occurred here: any ideas? 我担心这里发生了一些严重的过拟合：有什么想法吗？

Answer 1

I'm concerned that some drastic overfitting occurred here: any ideas? 我担心这里发生了一些严重的过拟合：有什么想法吗？

You're making predictions on the same dataset you've trained above: 您正在对上面训练过的同一数据集进行预测：

y_pred_forest = forest_classifier.predict(x)

scikit-learn随机森林：严重过度拟合？

问题描述

1 个解决方案

解决方案1
0 2018-10-28 05:17:34

scikit-learn随机森林：严重过度拟合？

问题描述

1 个解决方案

解决方案1 0 2018-10-28 05:17:34

解决方案1
0 2018-10-28 05:17:34