[英]Scikit-learn RandomForestClassifier output of predict_proba
I have a dataset that I split in two for training and testing a random forest classifier with scikit learn . 我有一个数据集,我分成两个用于训练和测试随机森林分类器与scikit学习 。
I have 87 classes and 344 samples. 我有87个班级和344个样本。 The output of
predict_proba
is, most of the times, a 3-dimensional array (87, 344, 2)
(it's actually a list
of 87 numpy.ndarray
s of (344, 2)
elements). 的输出
predict_proba
是,大部分的时间,一个3维阵列(87, 344, 2)
它实际上是一个list
的87 numpy.ndarray
第(344, 2)
的元素)。
Sometimes, when I pick a different subset of samples for training and testing, I only get a 2-dimensional array (87, 344)
(though I can't work out in which cases). 有时候,当我选择不同的样本子集进行训练和测试时,我只得到一个二维数组
(87, 344)
(尽管在哪些情况下我无法解决)。
My two questions are: 我的两个问题是:
(87, 344, 2)[:,:,1]
, transpose it, and then compare it with my ground truth ( roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T)
essentially) . But I don't understand what it really means. (87, 344, 2)[:,:,1]
,转置它,然后将它与我的基本事实进行比较( roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T)
本质上。但我不明白它的真正含义。 classifier.predict_proba()
returns the class probabilities. classifier.predict_proba()
返回类概率。 The n
dimension of the array will vary depending on how many classes there are in the subset you train on 数组的
n
维度将根据您训练的子集中有多少个类而有所不同
Are you sure the arrays you're using to fit the RF has the right shape ? 您确定用于适合RF的阵列具有正确的形状吗? (n_samples,n_features) for the data and (n_samples) for the target classes.
(n_samples,n_features)用于数据,(n_samples)用于目标类。 You should get an array Y_pred of shape (n_samples,n_classes) so (344,87) in your case, where item i of row r is the predictied probability of the class i for the sample X[r,:].
在你的情况下你应该得到一个数组Y_pred的形状(n_samples,n_classes)so(344,87),其中行r的项目i是样本X [r,:]的类i的预测概率。 Note that
sum( Y_pred[r,:] ) = 1
. 注意
sum( Y_pred[r,:] ) = 1
。
However I think if your target array Y has shape (n_samples,n_classes), where each row would be all zeros except one corresponding to the class of the sample, then sklearn take it as a multi-output prediction problem (consider each class independently) but I don't think that's what you'd like to do. 但是我想如果你的目标数组Y有形状(n_samples,n_classes),其中每一行都是零,除了一个对应于样本类的一行,那么sklearn将它作为一个多输出预测问题(单独考虑每个类)但我不认为这是你想做的。 In that case, for each class and each sample, you would predict the probability of belonging to this class or not.
在这种情况下,对于每个类和每个样本,您将预测属于该类的概率。
Finally the output indeed depend on the training set because it depends on the number of classes (in the training set). 最后,输出确实取决于训练集,因为它取决于类的数量(在训练集中)。 You can get it with the attribute
n_classes
(and you may also be able to force the number of classes by setting it manually) and you can also get the classes' value with the attribute classes
. 您可以使用属性
n_classes
获取它(并且您也可以通过手动设置强制类的数量),并且还可以使用属性classes
获取类的值。 See the documentation . 请参阅文档 。
Hope it helps ! 希望能帮助到你 !
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.