I have a data set that I'd like to impute missing values for. Instead of using column medians, I'd like to use a category median. I can create an aggregation but I'm wondering what the best way to integrate the two pieces. Here's a toy dataset.
df1 <- iris
set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Sepal.Length'] <- NA
set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Sepal.Width'] <- NA
set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Petal.Length'] <- NA
set.seed(456)
df1[sample(nrow(df1), 30, replace = F), 'Petal.Width'] <- NA
agg1 <- aggregate(. ~ Species, data = df1, FUN = median)
I know I can use a bunch of ifelse()'s
and loops to do this, but I assume there's a more elegant way. Any suggestions would be appreciated.
EDIT: Here's what I came up with on my own:
for(i in names(df1)[sapply(df1, is.numeric)]){ # i = "Sepal.Length"
for(k in agg1$Species){
df1[,i] <- ifelse(is.na(df1[,i]), agg1[which(agg1$Species == k),i], df1[,i])
}
}
There are a couple of ways to vectorize this operation.
If the order of rows is unimportant (ie, you're happy to append all the imputed rows last), then the following is an option:
df2 <- rbind(na.omit(df1),
agg1[match(df1[!complete.cases(df1), 'Species'], agg1$Species), ])
Alternatively, merge
can be used to retain the row order (this is probably preferable):
df1[!complete.cases(df1), -5] <-
merge(agg1, df1[!complete.cases(df1), 'Species', drop=FALSE],
by='Species')[, -c(1, 5)]
You could also use dplyr
library(dplyr)
library(tidyr)
Get the median values
dfMed <- df1%>%
gather(Var,Val, Sepal.Length:Petal.Width)%>%
group_by(Species, Var) %>%
summarize(Val=median(Val, na.rm=T))%>%
spread(Var,Val)
dfMed
# Source: local data frame [3 x 5]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 setosa 5.0 3.4 1.45 0.2
# 2 versicolor 5.9 2.9 4.40 1.3
# 3 virginica 6.4 3.0 5.50 2.0
inner_join
the result with NA rows of df1
dfJoin <- inner_join(dfMed, df1%>%
do(filter(., !complete.cases(.))), by="Species")[,c(2:5,1)]
Replace the missing value rows with the dfJoin
df1[!df1%>% complete.cases(),] <- dfJoin
Using data.table
:
First we get your data to a data.table
:
setDT(df1)
Then we get agg1
:
agg1 = df1[, lapply(.SD, median, na.rm=TRUE), by=Species]
setcolorder(agg1, chmatch(names(df1), names(agg1)))
Now, we replace the NA
s with this values by reference (no copy will be made) by a binary search based subset (much faster as opposed to a vector scan) on agg1
, once , only on those rows with all NA
s:
cols = names(df1)
setkey(agg1, Species)
df1[is.na(Sepal.Length) & is.na(Sepal.Width) & is.na(Petal.Length) &
is.na(Petal.Width), (cols) := agg1[J(Species)]]
The condtion in i
is spelt out completely because using complete.cases
could result in other rows which have NA
in just one or some column(s) in your data set, which as I understand shouldn't be replaced.
Here's what I ended up using:
imputeMed <- function(x){
medX <- median(x, na.rm = T)
x <- ifelse(is.na(x), medX, x)
return(x)
}
vtu1 <- names(df1)[sapply(df1, is.numeric)]
specLev <- unique(as.character(df1$Species))
for(i in specLev){ # i = specLev[1]
df1[df1$Species == i,vtu1] <- as.data.frame(lapply(df1[df1$Species == i,vtu1], imputeMed))
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.