简体   繁体   English

使用dplyr将列集中的NA替换为另一列集中的值

[英]Replace NA's in set of columns by values from another set of columns using dplyr

I found some related questions to mine that helped some, but all where different in a key part, so here it goes. 我发现了一些相关的问题,这些问题对某些问题有所帮助,但在关键方面各有不同,所以就到这里了。

I have a data frame with some NA's: 我有一个带有某些NA的数据框:

type <- LETTERS[1:5]
a_pc <- c(3, NA, NA , 4, 5)
b_pc <- c(NA, 2, 7, 4, 5)
a_pc_mean <- rep(mean(a_pc, na.rm = TRUE), times = 5)
b_pc_mean <- rep(mean(b_pc, na.rm = TRUE), times = 5)

df <- data.frame(type, a_pc, b_pc, a_pc_mean, b_pc_mean)

> df
  type a_pc b_pc a_pc_mean b_pc_mean
1    A    3   NA         4       4.5
2    B   NA    2         4       4.5
3    C   NA    7         4       4.5
4    D    4    4         4       4.5
5    E    5    5         4       4.5

I want to replace the NA's in the columns a_pc and b_pc with the values in their respective mean columns. 我想将a_pcb_pc列中的NA替换为其相应的均值列中的值。 I thought a clean way to do it was to use dplyr. 我认为一种干净的方法是使用dplyr。 My code so far is: 到目前为止,我的代码是:

library(dplyr)

df2 <- df %>%
  mutate_at(.vars = vars(ends_with("_pc")),
            .funs = funs(replace(., is.na(.), ???)

Where I put the question marks I need to reference to the columns with the means, but I cannot figure out what. 我在问号所在的地方需要参考带有方法的列,但是我无法弄清楚是什么。 My understanding of dplyr is that the . 我对dplyr的理解是. references the columns in vars(ends_with("_pc")) so I tried to paste0 together . 引用vars(ends_with("_pc"))的列,所以我尝试将paste0粘贴在一起. and "_mean" , but that didn't work. "_mean" ,但这没有用。 This question came close to mine, but it asked to replace by a fixed value, not a value from anther column. 这个问题很接近我的问题,但它要求用固定值代替,而不是花药列中的值。

My actual dataset has more then two columns in which I want to replace NA's , so I'd prefer not to reference them explicitly. 我的实际数据集有两列以上要替换NA的列,因此我不希望明确引用它们。

EDIT 编辑

My original question above didn't illustrate what I wanted to do, so to clarify I post a sample of my data: 我上面的原始问题并未说明我想做什么,因此为澄清起见,我发布了一个数据样本:

 > crime_pop
   subregion                 iso    year assault kidnapping      pop assault_pc kidnapping_pc
   <fct>                     <chr> <dbl>   <dbl>      <dbl>    <dbl>      <dbl>         <dbl>
 1 Caribbean                 ABW    2008      NA         NA   101353 NA           NA         
 2 Southern Asia             AFG    2008      NA         NA 27294031 NA           NA         
 3 Middle Africa             AGO    2008      NA         NA 21759420 NA           NA         
 4 Southern Europe           ALB    2008     363         10  2947314  0.000123     0.00000339
 5 Southern Europe           AND    2008     105          0    83861  0.00125      0         
 6 Western Asia              ARE    2008     631        672  6894278  0.0000915    0.0000975 
 7 South America             ARG    2008  145240         NA 40382389  0.00360     NA         
 8 Western Asia              ARM    2008     201         27  2908220  0.0000691    0.00000928
 9 Caribbean                 ATG    2008      NA         NA    92478 NA           NA         
10 Australia and New Zealand AUS    2008   68019        611 21249200  0.00320      0.0000288 

My idea was to interpolate the NA's in assault and kidnapping (and the other variables in the actual dataset) by calculating the per capita crime rates of the countries without missing data, taking the sub-region averages of these and applying these to the countries with the missing data. 我的想法是通过计算没有丢失数据的国家的人均犯罪率,对这些国家的人均犯罪率进行插值,将其的次区域平均值求平均值,然后将其应用于具有丢失的数据。

To calculate the per capita crime rates I used: 要计算我使用的人均犯罪率:

crime_pop <- crime_pop %>%
  mutate_at(.vars = vars(assault:kidnapping),
            .funs = funs(pc = . / pop))

The sub-region means can than be calculated using @Psidom 's answer: 然后可以使用@Psidom的答案来计算子区域均值:

crime_pop2 <- crime_pop %>%
  group_by(year, subregion) %>%
  mutate_at(vars(ends_with("_pc")),
            funs(replace(., is.na(.), mean(., na.rm = TRUE))))

Now the NA's in assault and kidnapping need the be replaced by the product of pop and assault_pc , and pop and kidnapping_pc respectively, which brings me back to my original question of referencing other columns in the replace function when used in mutate_at . 现在,NA处于assaultkidnapping需要分别用popassault_pc以及popkidnapping_pc assault_pc的乘积代替,这使我回到了原来的问题,即当在mutate_at使用时,在replace函数中引用其他列。 Maybe there is an easier way to do all this in one go, I'm open to suggestions. 也许有一种更简单的方法可以一次性完成所有这些工作,我愿意提出建议。 Thanks! 谢谢!

Simply use mean(., na.rm=TRUE) as the replacement: 只需使用mean(., na.rm=TRUE)作为替换:

df %>% mutate_at(vars(ends_with('_pc')), funs(replace(., is.na(.), mean(., na.rm=TRUE))))

#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

Or you can use coalesce that does the same thing, ie if values from . 或者,您可以使用执行相同操作的coalesce ,即,如果来自中的值. is NA, replace it with the mean: 是NA,将其替换为均值:

df %>% mutate_at(vars(ends_with('_pc')), funs(coalesce(., mean(., na.rm=TRUE))))

#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

Here's a solution that uses 'dplyr::select' to extract the named variables and pass them to 'impute' from the 'Hmisc' package. 这是一个使用'dplyr :: select'提取名称变量并将其传递给'Hmisc'包中的'impute'的解决方案。

bar   <- df  %>% dplyr::select(ends_with('_pc')) %>% 
sapply(., Hmisc::impute,fun= mean) 
df[, colnames(bar)] <- bar
df
#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM