[英]dplyr: how to ignore NA in grouping variable
Using dplyr, I'm trying to group by two variables. 使用dplyr,我试图按两个变量分组。 Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value.
现在,如果一个变量中有一个NA而另一个变量匹配,我仍然希望看到这些行分组,NA取非NA值的值。 So if I have a data frame like this:
所以,如果我有这样的数据框:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." 如果我想按变量_A和变量_进行分组,那么第1行和第4行通常不会分组,但我希望它们能够分组,而NA会被覆盖为“a”。 How can I achieve this?
我怎样才能做到这一点? The below doesn't do the job.
以下不起作用。
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You can group by B first, and then fill in the missing A values. 您可以先按B分组,然后填写缺少的A值。 Then proceed with what you wanted to do:
然后继续你想做的事:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You could do the missing value imputation using base R as follows: 您可以使用基数R执行缺失值插补,如下所示:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr 然后按计划用dplyr进行分组和汇总
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.