简体   繁体   中英

Fill missing values with missForest

I want to impute values using missForest ,

I have missing values in variables but not all.

When I do this it's really slow (that never happened)

mf_1 <- missForest(dtrain)

but when I do the following, it's ok

mf_1 <- missForest(dtrain[c(10,11,9,3)])

Do you think in the second case if missForest uses all variables or just 10, 11, 3, 9 to predict?

Subsetting the data frame will only pass those columns to the missForest() function, and as a result, it will only use these variables to impute the data. Here is an example:

library(missForest)
data(iris)

## We are going to have missing values in first 3 columns
## Introduce missing values 
iris_wih_NA <- missForest::prodNA(iris[c(1,2,3)], 0.3)
## Last two columns are then added
iris_wih_NA$Petal.Width <- iris$Petal.Width
iris_wih_NA$Species <- iris$Species
head(iris_wih_NA)

## Will use all variables to impute missing Values
iris_imputed1 <- missForest::missForest(iris_wih_NA)$ximp
## Will use only variables 1,2 and 3 to impute missing values
iris_imputed2 <- missForest::missForest(iris_wih_NA[c(1,2,3)])$ximp

As you can see, the second imputed data set only has 3 columns in total, as you have only provided it that much information.

As for the missForest imputation being slow, I think you could reduce the dimensionality of your data to get faster results or use the ntree parameter to limit the number of trees generated. But both of these options may adversely affect the quality of your results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM