简体   繁体   中英

Replace NA's in set of columns by values from another set of columns using dplyr

I found some related questions to mine that helped some, but all where different in a key part, so here it goes.

I have a data frame with some NA's:

type <- LETTERS[1:5]
a_pc <- c(3, NA, NA , 4, 5)
b_pc <- c(NA, 2, 7, 4, 5)
a_pc_mean <- rep(mean(a_pc, na.rm = TRUE), times = 5)
b_pc_mean <- rep(mean(b_pc, na.rm = TRUE), times = 5)

df <- data.frame(type, a_pc, b_pc, a_pc_mean, b_pc_mean)

> df
  type a_pc b_pc a_pc_mean b_pc_mean
1    A    3   NA         4       4.5
2    B   NA    2         4       4.5
3    C   NA    7         4       4.5
4    D    4    4         4       4.5
5    E    5    5         4       4.5

I want to replace the NA's in the columns a_pc and b_pc with the values in their respective mean columns. I thought a clean way to do it was to use dplyr. My code so far is:

library(dplyr)

df2 <- df %>%
  mutate_at(.vars = vars(ends_with("_pc")),
            .funs = funs(replace(., is.na(.), ???)

Where I put the question marks I need to reference to the columns with the means, but I cannot figure out what. My understanding of dplyr is that the . references the columns in vars(ends_with("_pc")) so I tried to paste0 together . and "_mean" , but that didn't work. This question came close to mine, but it asked to replace by a fixed value, not a value from anther column.

My actual dataset has more then two columns in which I want to replace NA's , so I'd prefer not to reference them explicitly.

EDIT

My original question above didn't illustrate what I wanted to do, so to clarify I post a sample of my data:

 > crime_pop
   subregion                 iso    year assault kidnapping      pop assault_pc kidnapping_pc
   <fct>                     <chr> <dbl>   <dbl>      <dbl>    <dbl>      <dbl>         <dbl>
 1 Caribbean                 ABW    2008      NA         NA   101353 NA           NA         
 2 Southern Asia             AFG    2008      NA         NA 27294031 NA           NA         
 3 Middle Africa             AGO    2008      NA         NA 21759420 NA           NA         
 4 Southern Europe           ALB    2008     363         10  2947314  0.000123     0.00000339
 5 Southern Europe           AND    2008     105          0    83861  0.00125      0         
 6 Western Asia              ARE    2008     631        672  6894278  0.0000915    0.0000975 
 7 South America             ARG    2008  145240         NA 40382389  0.00360     NA         
 8 Western Asia              ARM    2008     201         27  2908220  0.0000691    0.00000928
 9 Caribbean                 ATG    2008      NA         NA    92478 NA           NA         
10 Australia and New Zealand AUS    2008   68019        611 21249200  0.00320      0.0000288 

My idea was to interpolate the NA's in assault and kidnapping (and the other variables in the actual dataset) by calculating the per capita crime rates of the countries without missing data, taking the sub-region averages of these and applying these to the countries with the missing data.

To calculate the per capita crime rates I used:

crime_pop <- crime_pop %>%
  mutate_at(.vars = vars(assault:kidnapping),
            .funs = funs(pc = . / pop))

The sub-region means can than be calculated using @Psidom 's answer:

crime_pop2 <- crime_pop %>%
  group_by(year, subregion) %>%
  mutate_at(vars(ends_with("_pc")),
            funs(replace(., is.na(.), mean(., na.rm = TRUE))))

Now the NA's in assault and kidnapping need the be replaced by the product of pop and assault_pc , and pop and kidnapping_pc respectively, which brings me back to my original question of referencing other columns in the replace function when used in mutate_at . Maybe there is an easier way to do all this in one go, I'm open to suggestions. Thanks!

Simply use mean(., na.rm=TRUE) as the replacement:

df %>% mutate_at(vars(ends_with('_pc')), funs(replace(., is.na(.), mean(., na.rm=TRUE))))

#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

Or you can use coalesce that does the same thing, ie if values from . is NA, replace it with the mean:

df %>% mutate_at(vars(ends_with('_pc')), funs(coalesce(., mean(., na.rm=TRUE))))

#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

Here's a solution that uses 'dplyr::select' to extract the named variables and pass them to 'impute' from the 'Hmisc' package.

bar   <- df  %>% dplyr::select(ends_with('_pc')) %>% 
sapply(., Hmisc::impute,fun= mean) 
df[, colnames(bar)] <- bar
df
#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM