简体   繁体   中英

NAs in classCenter function of the R randomForest package

I am trying to retrieve class prototypes for a two-class classification problem using the randomForest package version 4.6-7 for the R programming language version 2.13.1. For this, I call the classCenter function. The problem is that it sometimes outputs an invalid result, ie, one or both of the returned class prototypes consist entirely of NA values. When this happens, I get the following in the R console:

There were 50 or more warnings (use warnings() to see the first 50)

Typing warnings() gives the following 50 times:

1: In mean.default(sort(x, partial = half + 0L:1L)[half +  ... :
  argument is not numeric or logical: returning NA

IMPORTANT: I have noticed that the function gives a different output for different random forest models learned on the same data and using the same settings, ie, it may return both class prototypes for one, but none for another model. That means that at least sometimes I get valid results.

I use this code in the R console:

library(randomForest)
mydata <- read.csv("mydata.csv", header=TRUE)
myrf <- randomForest(x=mydata[,-1:-2], y=mydata[,1], ntree=1000, mtry=33, importance=TRUE, proximity=TRUE)
mycc <- classCenter(mydata[,-1:-2], mydata[,1], myrf$prox)
print(mycc)

The first column of the CSV file contains the class labels, the second one is ignored. There are 5,000 examples of positive and 5,000 examples of negative class, all with 135 features/variables, no missing values (see below).

I have searched on stackoverflow and google for a solution to this problem, but to no avail. The documentation for the randomForest package doesn't specify the "all-NAs" return value. I have to say that I'm not familiar with R and have hacked this piece of code using documentation and intuition.

EDIT: mydata[!complete.cases(mydata),] is empty, ie, there are no missing (NA) values in the input data. The output of summary(mydata) and mydata[1:10,] can be found here (you might want to view the file in a text editor without word wrap as the text is formatted wide). The first 10 rows are, of course, not enough to reproduce the error, but I'm not allowed to post the entire dataset.

I was also facing same problem, got rid of it today. You can do couple of things (1) Please drop the column having null values or NA values OR (2) Remove the rows with null values or NA values OR (3) Replace the null or NA values in all columns with appropriate treatment like average, median or mode values (if feasible).Thanks

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM