简体   繁体   中英

does randomForest [R] not accept logical variable as response, but accept it as predictor?

Hi I'm using randomForest in R and it doesn't accept logical variable as response (Y), but seems to accept it as predictor (X). I'm a little surprised b/c I thought logical is essentially 2-class factor...

My question is: is it true that randomForest accepts logical as predictor, but not as response? Why is it like this? Does other common models (glmnet, svm, ...) accept logical variables?

Any explanation/discussion is appreciated. Thanks

N = 100

data1 = data.frame(age = sample(1:80, N, replace=T),
                   sex = sample(c('M', 'F'), N, replace=T),
                   veteran = sample(c(T, F), N, replace=T),
                   exercise = sample(c(T, F), N, replace=T))

sapply(data1, class)
#       age       sex   veteran  exercise 
# "integer"  "factor" "logical" "logical" 

# this doesnt work b/c exercise is logical
rf = randomForest(exercise ~ ., data = data1, importance = T)
# Warning message:
#         In randomForest.default(m, y, ...) :
#         The response has five or fewer unique values.  Are you sure you want to do regression?

# this works, and veteran and exercise (logical) work as predictors
rf = randomForest(sex ~ ., data = data1, importance = T)
importance(rf)
#                   F         M MeanDecreaseAccuracy MeanDecreaseGini
# age      -2.0214486 -7.584637            -6.242150         6.956147
# veteran   4.6509542  3.168551             4.605862         1.846428
# exercise -0.1205806 -6.226174            -3.924871         1.013030

# convert it to factor and it works
rf = randomForest(as.factor(exercise) ~ ., data = data1, importance = T)

The reason for this behaviour is that randomForest is also able to do regression (in addition to classification). You can also observe it in the warning message you obtained:

The response has five or fewer unique values. Are you sure you want to do regression?

The function decides between regression and classification depending on the type of the given response vector. If it is a factor classification is done, otherwise regression (which makes sense, as a regression response vector will never be a factor / categorical variable).

Regarding your question: It is no problem to use logical variables in your input dataset (predictor), randomForest is able to handle that perfectly as you would expect.

training_data <- data.frame(x = rep(c(T,F), times = 1000)) # training data with logical
response <- as.factor(rep(c(F,T), times = 1000)) # inverse of training data
randomForest(response ~ ., data = training_data) # returns 100% accurate classifier

EDIT:

why they don't include this coercion (logical to factor) in the source code?

It's speculation but it might be for consistency and simplicity. They would have to change the documentation from

If a factor, classification is assumed, otherwise regression is assumed

to

If a factor or a logical vector, classification is assumed, otherwise regression is assumed

And then people might show up asking for character... Also you can have issues if your logical response vector only contains either TRUE or FALSE values. If you coerce such a vector to factor it will only have one level. (Although it does not really make sense to train a model on a dataset where the outcome is always FALSE)

But if the authors included such a more "intelligent" coercion, they would have to deal with those questions and define the behaviour in those border cases, and also document it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM