[英]Confusion Matrix for 10 cross fold - How to do it pandas dataframe df
I'm trying to get 10 fold confusion matrix for any models (Random forest, Decision tree, Naive Bayes. etc) I can able to get each confusion matrix normally if I run for normal model as below shown:我正在尝试为任何模型(随机森林、决策树、朴素贝叶斯等)获得 10 倍混淆矩阵,如果我为正常模型运行,我可以正常获得每个混淆矩阵,如下所示:
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
# implementing train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=66)
# random forest model creation
rfc = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
rfc.fit(X_train,y_train)
# predictions
rfc_predict = rfc.predict(X_test)
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
Out[1]:出[1]:
=== Confusion Matrix === [[16243 1011] [ 827 16457]] === Classification Report === precision recall f1-score support 0 0.95 0.94 0.95 17254 1 0.94 0.95 0.95 17284 accuracy 0.95 34538 macro avg 0.95 0.95 0.95 34538 weighted avg 0.95 0.95 0.95 34538
But, now I want to get confusion matrix for 10 cv fold .但是,现在我想获得10 cv fold 的混淆矩阵。 How should I approach or do it.我应该如何接近或去做。 I tried this but not working.我试过这个但没有用。
# from sklearn import cross_validation
from sklearn.model_selection import cross_validate
kfold = KFold(n_splits=10)
conf_matrix_list_of_arrays = []
kf = cross_validate(rfc, X, y, cv=kfold)
print(kf)
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
conf_matrix = confusion_matrix(y_test, rfc.predict(X_test))
conf_matrix_list_of_arrays.append(conf_matrix)
Dataset consists of this dataframe dp数据集由这个数据帧 dp 组成
Temperature Series Parallel Shading Number of cells Voltage(V) Current(I) I/V Solar Panel Cell Shade Percentage IsShade 30 10 1 2 10 1.11 2.19 1.97 1985 1 20.0 1 27 5 2 10 10 2.33 4.16 1.79 1517 3 100.0 1 30 5 2 7 10 2.01 4.34 2.16 3532 1 70.0 1 40 2 4 3 8 1.13 -20.87 -18.47 6180 1 37.5 1 45 5 2 4 10 1.13 6.52 5.77 8812 3 40.0 1
From the help page for cross_validate it doesn't return the indexes used for cross-validation.从cross_validate的帮助页面,它不会返回用于交叉验证的索引。 You need to access the indices from the (Stratified)KFold, using an example dataset:您需要使用示例数据集从 (Stratified)KFold 访问索引:
from sklearn import datasets, linear_model
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
data = datasets.load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=66)
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
skf.split(X_train,y_train)
rfc = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
y_pred = cross_val_predict(rfc, X_train, y_train, cv=skf)
We apply cross_val_predict
to get all the predictions:我们应用cross_val_predict
来获得所有预测:
y_pred = cross_val_predict(rfc, X, y, cv=skf)
Then use the indices to split this y_pred to each confusion matrix:然后使用索引将此 y_pred 拆分为每个混淆矩阵:
mats = []
for train_index, test_index in skf.split(X_train,y_train):
mats.append(confusion_matrix(y_train[test_index],y_pred[test_index]))
Looks like this:看起来像这样:
mats[:3]
[array([[13, 2],
[ 0, 23]]),
array([[14, 1],
[ 1, 22]]),
array([[14, 1],
[ 0, 23]])]
Check that the addition of the matrices list and total sum is the same:检查矩阵列表和总和的相加是否相同:
np.add.reduce(mats)
array([[130, 14],
[ 6, 225]])
confusion_matrix(y_train,y_pred)
array([[130, 14],
[ 6, 225]])
For me the problem here stands in the incorrect unpacking of kf
.对我来说,这里的问题在于kf
的不正确解包。 Indeed, cross_validate()
returns a dictionary of arrays with test_scores and fit/score times by default.事实上, cross_validate()
默认返回一个包含 test_scores 和 fit/score 时间的数组字典。
You can leverage instead on split()
method of your Kfold
instance, that helps you generating indices to split data into training and test(validation) set.您可以改用Kfold
实例的split()
方法,该方法可帮助您生成索引以将数据拆分为训练和测试(验证)集。 Therefore, by changing into因此,通过改成
for train_index, test_index in kfold.split(X_train, y_train):
you should get what you are looking for.你应该得到你正在寻找的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.