简体   繁体   中英

P-Value for Random Forest

I'm new to R, so sorry if this question is trivial. I try to calculate the p-value for my Random Forest classification, by shuffling the class labels. Here an example using the iris data set with my code so far:

     rows <- sample(rownames(iris), replace = TRUE, size = length(rownames(iris))*0.8)
     train <- iris[rows,]
     validation <- iris[-as.numeric(names(table(rows))),]

     fit <- randomForest:::randomForest(Species ~ .,
                               data=train, 
                               importance=TRUE, 
                               ntree=1000)    
     Prediction <- predict(fit, validation)
     confmatrix <- table(validation[,"Species"], Prediction)
     confusionMatrix(confmatrix) 

I read about a package called rfPerform. After reading the Help page I came up with the following code:

     rfPermute(Species ~ ., data = iris, ntree = 100, na.action = na.omit, nrep = 50)$pval

Here my problem: I don't understand the output (scaled and unscaled) (sorry, I'm not a statistician and after reading I still don't get the difference). Is it possible to obtain a single p-Value out of those many, by eg calculating the median of all p-Values? The question I want to address is whether the result of my Random Forest occured by chance or is significant. I'm not interested in one particular feature or one specific class.

Thanks for help!

There is difference when you scale a variable and when you do not. After scaling the variables of your dataset you aim for all your variables to have the same variance (usually 1). This allows for variables with many outliers, extreme values etc, to be good enough to use for comparisons with other variables. Thus the two arrays indicated the results of the algorithm once with scaled variables and one with not scaled variables.

Next thing you need to clarify yourself is what the algorithm you run does. Simply blindly running an algorithm you don't understand will do more harm to what you research than good. You can read plenty of them online if you just google it.

The output you are interested in can't be summarized in one p-value. However, the output gives you the p-values of the Species for which each and one has its own grown tree. There you can see which tree is statistically significant. The whole output is important, cause then you can see for which species you are able to make statistically significant assumptions.

Hope I answered your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM