简体   繁体   English

随机森林的P值

[英]P-Value for Random Forest

I'm new to R, so sorry if this question is trivial. 我是R的新手,如果这个问题很简单,请对不起。 I try to calculate the p-value for my Random Forest classification, by shuffling the class labels. 我尝试通过改组类标签来计算我的随机森林分类的​​p值。 Here an example using the iris data set with my code so far: 这是到目前为止使用虹膜数据集和我的代码的示例:

     rows <- sample(rownames(iris), replace = TRUE, size = length(rownames(iris))*0.8)
     train <- iris[rows,]
     validation <- iris[-as.numeric(names(table(rows))),]

     fit <- randomForest:::randomForest(Species ~ .,
                               data=train, 
                               importance=TRUE, 
                               ntree=1000)    
     Prediction <- predict(fit, validation)
     confmatrix <- table(validation[,"Species"], Prediction)
     confusionMatrix(confmatrix) 

I read about a package called rfPerform. 我读了一个名为rfPerform的软件包。 After reading the Help page I came up with the following code: 阅读帮助页面后,我想到了以下代码:

     rfPermute(Species ~ ., data = iris, ntree = 100, na.action = na.omit, nrep = 50)$pval

Here my problem: I don't understand the output (scaled and unscaled) (sorry, I'm not a statistician and after reading I still don't get the difference). 这是我的问题:我不了解输出(缩放和未缩放)(对不起,我不是统计学家,阅读后仍然看不到差异)。 Is it possible to obtain a single p-Value out of those many, by eg calculating the median of all p-Values? 是否可以通过例如计算所有p值的中位数来获得众多p值中的一个? The question I want to address is whether the result of my Random Forest occured by chance or is significant. 我要解决的问题是随机森林的结果是偶然发生的还是有意义的。 I'm not interested in one particular feature or one specific class. 我对某一特定功能或某特定类不感兴趣。

Thanks for help! 感谢帮助!

There is difference when you scale a variable and when you do not. 缩放变量与不缩放变量有区别。 After scaling the variables of your dataset you aim for all your variables to have the same variance (usually 1). 缩放数据集的变量后,您的目标是使所有变量具有相同的方差(通常为1)。 This allows for variables with many outliers, extreme values etc, to be good enough to use for comparisons with other variables. 这允许具有许多离群值,极值等的变量足够好以用于与其他变量进行比较。 Thus the two arrays indicated the results of the algorithm once with scaled variables and one with not scaled variables. 因此,这两个数组分别用缩放变量和不缩放变量表示算法的结果。

Next thing you need to clarify yourself is what the algorithm you run does. 接下来需要说明的是运行的算法。 Simply blindly running an algorithm you don't understand will do more harm to what you research than good. 简单地盲目运行一个您不了解的算法,对您研究的内容弊大于利。 You can read plenty of them online if you just google it. 如果您只搜索Google,就可以在线阅读很多内容。

The output you are interested in can't be summarized in one p-value. 您感兴趣的输出不能汇总为一个p值。 However, the output gives you the p-values of the Species for which each and one has its own grown tree. 但是,输出为您提供了Species的p值,每个Species的p值都有自己的生长树。 There you can see which tree is statistically significant. 在那里您可以看到哪棵树具有统计意义。 The whole output is important, cause then you can see for which species you are able to make statistically significant assumptions. 整个输出很重要,因为这样您就可以看到您能够为哪些物种做出具有统计意义的假设。

Hope I answered your question. 希望我回答了你的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM