简体   繁体   中英

R use aggregate() output for imputing NA

I have a data set that I'd like to impute missing values for. Instead of using column medians, I'd like to use a category median. I can create an aggregation but I'm wondering what the best way to integrate the two pieces. Here's a toy dataset.

df1 <- iris

set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Sepal.Length'] <- NA

set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Sepal.Width'] <- NA

set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Petal.Length'] <- NA

set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Petal.Width'] <- NA

agg1 <- aggregate(. ~ Species, data = df1, FUN = median)

I know I can use a bunch of ifelse()'s and loops to do this, but I assume there's a more elegant way. Any suggestions would be appreciated.

EDIT: Here's what I came up with on my own:

for(i in names(df1)[sapply(df1, is.numeric)]){  # i = "Sepal.Length"

    for(k in agg1$Species){
        df1[,i] <- ifelse(is.na(df1[,i]), agg1[which(agg1$Species == k),i], df1[,i])
    }

}

There are a couple of ways to vectorize this operation.

If the order of rows is unimportant (ie, you're happy to append all the imputed rows last), then the following is an option:

df2 <- rbind(na.omit(df1),
             agg1[match(df1[!complete.cases(df1), 'Species'], agg1$Species), ])

Alternatively, merge can be used to retain the row order (this is probably preferable):

df1[!complete.cases(df1), -5] <- 
  merge(agg1, df1[!complete.cases(df1), 'Species', drop=FALSE], 
        by='Species')[, -c(1, 5)]

You could also use dplyr

library(dplyr)
library(tidyr)

Get the median values

dfMed <- df1%>%
gather(Var,Val, Sepal.Length:Petal.Width)%>%
group_by(Species, Var) %>% 
summarize(Val=median(Val, na.rm=T))%>% 
spread(Var,Val)


 dfMed
# Source: local data frame [3 x 5]

#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa          5.0         3.4         1.45         0.2
# 2 versicolor          5.9         2.9         4.40         1.3
# 3  virginica          6.4         3.0         5.50         2.0

inner_join the result with NA rows of df1

dfJoin <- inner_join(dfMed, df1%>%
do(filter(., !complete.cases(.))), by="Species")[,c(2:5,1)]

Replace the missing value rows with the dfJoin

df1[!df1%>% complete.cases(),] <- dfJoin 

Using data.table :

First we get your data to a data.table :

setDT(df1)

Then we get agg1 :

agg1 = df1[, lapply(.SD, median, na.rm=TRUE), by=Species]
setcolorder(agg1, chmatch(names(df1), names(agg1)))

Now, we replace the NA s with this values by reference (no copy will be made) by a binary search based subset (much faster as opposed to a vector scan) on agg1 , once , only on those rows with all NA s:

cols = names(df1)
setkey(agg1, Species)
df1[is.na(Sepal.Length) & is.na(Sepal.Width) & is.na(Petal.Length) & 
    is.na(Petal.Width), (cols) := agg1[J(Species)]]

The condtion in i is spelt out completely because using complete.cases could result in other rows which have NA in just one or some column(s) in your data set, which as I understand shouldn't be replaced.

Here's what I ended up using:

imputeMed <- function(x){
    medX <- median(x, na.rm = T)
    x <- ifelse(is.na(x), medX, x)
    return(x)
}

vtu1 <- names(df1)[sapply(df1, is.numeric)]
specLev <- unique(as.character(df1$Species))

for(i in specLev){  # i = specLev[1]

df1[df1$Species == i,vtu1] <- as.data.frame(lapply(df1[df1$Species == i,vtu1], imputeMed))

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM