[英]Replace NA's in set of columns by values from another set of columns using dplyr
I found some related questions to mine that helped some, but all where different in a key part, so here it goes. 我发现了一些相关的问题,这些问题对某些问题有所帮助,但在关键方面各有不同,所以就到这里了。
I have a data frame with some NA's: 我有一个带有某些NA的数据框:
type <- LETTERS[1:5]
a_pc <- c(3, NA, NA , 4, 5)
b_pc <- c(NA, 2, 7, 4, 5)
a_pc_mean <- rep(mean(a_pc, na.rm = TRUE), times = 5)
b_pc_mean <- rep(mean(b_pc, na.rm = TRUE), times = 5)
df <- data.frame(type, a_pc, b_pc, a_pc_mean, b_pc_mean)
> df
type a_pc b_pc a_pc_mean b_pc_mean
1 A 3 NA 4 4.5
2 B NA 2 4 4.5
3 C NA 7 4 4.5
4 D 4 4 4 4.5
5 E 5 5 4 4.5
I want to replace the NA's in the columns a_pc
and b_pc
with the values in their respective mean columns. 我想将
a_pc
和b_pc
列中的NA替换为其相应的均值列中的值。 I thought a clean way to do it was to use dplyr. 我认为一种干净的方法是使用dplyr。 My code so far is:
到目前为止,我的代码是:
library(dplyr)
df2 <- df %>%
mutate_at(.vars = vars(ends_with("_pc")),
.funs = funs(replace(., is.na(.), ???)
Where I put the question marks I need to reference to the columns with the means, but I cannot figure out what. 我在问号所在的地方需要参考带有方法的列,但是我无法弄清楚是什么。 My understanding of dplyr is that the
.
我对dplyr的理解是
.
references the columns in vars(ends_with("_pc"))
so I tried to paste0 together .
引用
vars(ends_with("_pc"))
的列,所以我尝试将paste0粘贴在一起.
and "_mean"
, but that didn't work. 和
"_mean"
,但这没有用。 This question came close to mine, but it asked to replace by a fixed value, not a value from anther column. 这个问题很接近我的问题,但它要求用固定值代替,而不是花药列中的值。
My actual dataset has more then two columns in which I want to replace NA's , so I'd prefer not to reference them explicitly. 我的实际数据集有两列以上要替换NA的列,因此我不希望明确引用它们。
EDIT 编辑
My original question above didn't illustrate what I wanted to do, so to clarify I post a sample of my data: 我上面的原始问题并未说明我想做什么,因此为澄清起见,我发布了一个数据样本:
> crime_pop
subregion iso year assault kidnapping pop assault_pc kidnapping_pc
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Caribbean ABW 2008 NA NA 101353 NA NA
2 Southern Asia AFG 2008 NA NA 27294031 NA NA
3 Middle Africa AGO 2008 NA NA 21759420 NA NA
4 Southern Europe ALB 2008 363 10 2947314 0.000123 0.00000339
5 Southern Europe AND 2008 105 0 83861 0.00125 0
6 Western Asia ARE 2008 631 672 6894278 0.0000915 0.0000975
7 South America ARG 2008 145240 NA 40382389 0.00360 NA
8 Western Asia ARM 2008 201 27 2908220 0.0000691 0.00000928
9 Caribbean ATG 2008 NA NA 92478 NA NA
10 Australia and New Zealand AUS 2008 68019 611 21249200 0.00320 0.0000288
My idea was to interpolate the NA's in assault and kidnapping (and the other variables in the actual dataset) by calculating the per capita crime rates of the countries without missing data, taking the sub-region averages of these and applying these to the countries with the missing data. 我的想法是通过计算没有丢失数据的国家的人均犯罪率,对这些国家的人均犯罪率进行插值,将其的次区域平均值求平均值,然后将其应用于具有丢失的数据。
To calculate the per capita crime rates I used: 要计算我使用的人均犯罪率:
crime_pop <- crime_pop %>%
mutate_at(.vars = vars(assault:kidnapping),
.funs = funs(pc = . / pop))
The sub-region means can than be calculated using @Psidom 's answer: 然后可以使用@Psidom的答案来计算子区域均值:
crime_pop2 <- crime_pop %>%
group_by(year, subregion) %>%
mutate_at(vars(ends_with("_pc")),
funs(replace(., is.na(.), mean(., na.rm = TRUE))))
Now the NA's in assault
and kidnapping
need the be replaced by the product of pop
and assault_pc
, and pop
and kidnapping_pc
respectively, which brings me back to my original question of referencing other columns in the replace function when used in mutate_at
. 现在,NA处于
assault
和kidnapping
需要分别用pop
和assault_pc
以及pop
和kidnapping_pc
assault_pc
的乘积代替,这使我回到了原来的问题,即当在mutate_at
使用时,在replace函数中引用其他列。 Maybe there is an easier way to do all this in one go, I'm open to suggestions. 也许有一种更简单的方法可以一次性完成所有这些工作,我愿意提出建议。 Thanks!
谢谢!
Simply use mean(., na.rm=TRUE)
as the replacement: 只需使用
mean(., na.rm=TRUE)
作为替换:
df %>% mutate_at(vars(ends_with('_pc')), funs(replace(., is.na(.), mean(., na.rm=TRUE))))
# type a_pc b_pc a_pc_mean b_pc_mean
#1 A 3 4.5 4 4.5
#2 B 4 2.0 4 4.5
#3 C 4 7.0 4 4.5
#4 D 4 4.0 4 4.5
#5 E 5 5.0 4 4.5
Or you can use coalesce
that does the same thing, ie if values from .
或者,您可以使用执行相同操作的
coalesce
,即,如果来自中的值.
is NA, replace it with the mean: 是NA,将其替换为均值:
df %>% mutate_at(vars(ends_with('_pc')), funs(coalesce(., mean(., na.rm=TRUE))))
# type a_pc b_pc a_pc_mean b_pc_mean
#1 A 3 4.5 4 4.5
#2 B 4 2.0 4 4.5
#3 C 4 7.0 4 4.5
#4 D 4 4.0 4 4.5
#5 E 5 5.0 4 4.5
Here's a solution that uses 'dplyr::select' to extract the named variables and pass them to 'impute' from the 'Hmisc' package. 这是一个使用'dplyr :: select'提取名称变量并将其传递给'Hmisc'包中的'impute'的解决方案。
bar <- df %>% dplyr::select(ends_with('_pc')) %>%
sapply(., Hmisc::impute,fun= mean)
df[, colnames(bar)] <- bar
df
# type a_pc b_pc a_pc_mean b_pc_mean
#1 A 3 4.5 4 4.5
#2 B 4 2.0 4 4.5
#3 C 4 7.0 4 4.5
#4 D 4 4.0 4 4.5
#5 E 5 5.0 4 4.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.