简体繁体 English

Scikit-学习predict_proba的RandomForestClassifier输出

[英]Scikit-learn RandomForestClassifier output of predict_proba

原文 2015-02-02 16:54:35 7 2 python/ scikit-learn/ random-forest

I have a dataset that I split in two for training and testing a random forest classifier with scikit learn . 我有一个数据集，我分成两个用于训练和测试随机森林分类器与scikit学习 。

I have 87 classes and 344 samples. 我有87个班级和344个样本。 The output of predict_proba is, most of the times, a 3-dimensional array (87, 344, 2) (it's actually a list of 87 numpy.ndarray s of (344, 2) elements). 的输出predict_proba是，大部分的时间，一个3维阵列(87, 344, 2)它实际上是一个list的87 numpy.ndarray第(344, 2)的元素）。

Sometimes, when I pick a different subset of samples for training and testing, I only get a 2-dimensional array (87, 344) (though I can't work out in which cases). 有时候，当我选择不同的样本子集进行训练和测试时，我只得到一个二维数组(87, 344) （尽管在哪些情况下我无法解决）。

My two questions are: 我的两个问题是：

what do these dimensions represent? 这些尺寸代表什么？ I worked out that to get a ROC AUC score, I have to take one half of the output (that is (87, 344, 2)[:,:,1] , transpose it, and then compare it with my ground truth ( roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T) essentially) . But I don't understand what it really means. 我得出了得到ROC AUC分数，我必须得到输出的一半（即(87, 344, 2)[:,:,1] ，转置它，然后将它与我的基本事实进行比较（ roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T)本质上。但我不明白它的真正含义。
why does the output change with different subsets of the data? 为什么输出会随着数据的不同子集而变化？ I can't understand in which cases it returns a 3D array and in which cases a 2D one. 我无法理解它在哪种情况下返回3D数组，在哪种情况下返回2D数组。

2 个解决方案

classifier.predict_proba() returns the class probabilities. classifier.predict_proba()返回类概率。 The n dimension of the array will vary depending on how many classes there are in the subset you train on 数组的n维度将根据您训练的子集中有多少个类而有所不同

Are you sure the arrays you're using to fit the RF has the right shape ? 您确定用于适合RF的阵列具有正确的形状吗？ (n_samples,n_features) for the data and (n_samples) for the target classes. （n_samples，n_features）用于数据，（n_samples）用于目标类。 You should get an array Y_pred of shape (n_samples,n_classes) so (344,87) in your case, where item i of row r is the predictied probability of the class i for the sample X[r,:]. 在你的情况下你应该得到一个数组Y_pred的形状（n_samples，n_classes）so（344,87），其中行r的项目i是样本X [r，：]的类i的预测概率。 Note that sum( Y_pred[r,:] ) = 1 . 注意sum( Y_pred[r,:] ) = 1 。

However I think if your target array Y has shape (n_samples,n_classes), where each row would be all zeros except one corresponding to the class of the sample, then sklearn take it as a multi-output prediction problem (consider each class independently) but I don't think that's what you'd like to do. 但是我想如果你的目标数组Y有形状（n_samples，n_classes），其中每一行都是零，除了一个对应于样本类的一行，那么sklearn将它作为一个多输出预测问题（单独考虑每个类）但我不认为这是你想做的。 In that case, for each class and each sample, you would predict the probability of belonging to this class or not. 在这种情况下，对于每个类和每个样本，您将预测属于该类的概率。

Finally the output indeed depend on the training set because it depends on the number of classes (in the training set). 最后，输出确实取决于训练集，因为它取决于类的数量（在训练集中）。 You can get it with the attribute n_classes (and you may also be able to force the number of classes by setting it manually) and you can also get the classes' value with the attribute classes . 您可以使用属性n_classes获取它（并且您也可以通过手动设置强制类的数量），并且还可以使用属性classes获取类的值。 See the documentation . 请参阅文档。