简体   繁体   中英

Replacing NAs in a data frame with values from a different column

I would like to replace NAs in my data frame with values from another column. For example:

a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
  a1 b1 c1 a2 b2 c2
1  1  3 NA  2  1  3
2  2 NA  3  3  2  3
3  4  4  3  5  4  2
4 NA  4  4  5  5  3
5  2  4  2  3  6  4
6 NA  3  3  4  3  3

I would like replace the NAs in df$a1 with the values from the corresponding row in df$a2 , the NAs in df$b1 with the values from the corresponding row in df$b2 , and the NAs in df$c1 with the values from the corresponding row in df$c2 so that the new data frame looks like:

> df
  a1 b1 c1
1  1  3  3
2  2  2  3
3  4  4  3
4  5  4  4
5  2  4  2
6  4  3  3

How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column ). Thank you!

An extensible option:

df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
                df[c('a1','b1','c1')], df[c('a2','b2','c2')],
                SIMPLIFY=FALSE)
df2
#   a1 b1 c1
# 1  1  3  3
# 2  2  2  3
# 3  4  4  3
# 4  5  4  4
# 5  2  4  2
# 6  4  3  3

It's easy enough to extend this to arbitrary column pairs: the first column in the first subset ( df[c('a1','b1','c1')] ) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))] and df[grepl('2$',colnames(df))] , assuming they don't mis-match.

coalesce in dplyr is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). eg

coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4

It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:

sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
     a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3

dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)

as.data.frame(dfnew)

this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo

You can use hutils::coalesce . It should be slightly faster, especially if it can 'cheat' -- if any columns have no NA s and so don't need to change, coalesce will skip them:

a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)

s <- function(x) {
  sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
                          a2 = s(a2), b2 = s(b2), c2 = s(c2)))

library(microbenchmark)
library(hutils)
library(data.table)

dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2") 

dplyr_coalesce <- function(df) {
  ans <- df
  for (j in seq_along(old)) {
    o <- old[j]
    n <- new[j]
    ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
  }
  ans
}

hutils_coalesce <- function(df) {
  ans <- df
  for (j in seq_along(old)) {
    o <- old[j]
    n <- new[j]
    ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
  }
  ans
}

microbenchmark(dplyr = dplyr_coalesce(df),
               hutils = hutils_coalesce(df))
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq       max neval cld
#>   dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800   100   b
#>  hutils 36.48602 46.76336 63.46643 52.95736 64.53066  252.5608   100  a

Created on 2018-03-29 by the reprex package (v0.2.0).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM