dplyr: how to ignore NA in grouping variable

Question

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:

variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)

And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.

df2 <- df %>%
         group_by(variable_A, variable_B) %>%
         summarise(total=sum(variable_C))

Answer 1

You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:

df_filled = df %>%
    group_by(variable_B) %>%
    mutate(variable_A = first(na.omit(variable_A)))

df_filled %>%
    group_by(variable_A, variable_B) %>%
    summarise(total=sum(variable_C))

Answer 2

You could do the missing value imputation using base R as follows:

 ii <- which(is.na(df$variable_A))
 jj <- which(df$variable_B == df$variable_B[ii])
 df_filled <- df
 df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]

Then group and summarize as planned with dplyr

 df_filled %>%
 group_by(variable_A, variable_B) %>%
 dplyr::summarise(total=sum(variable_C))

dplyr: how to ignore NA in grouping variable

Question

2 answers

solution1
3 2018-06-29 01:27:17

solution2
0 2018-06-29 02:34:42

dplyr: how to ignore NA in grouping variable

Question

2 answers

solution1 3 2018-06-29 01:27:17

solution2 0 2018-06-29 02:34:42

solution1
3 2018-06-29 01:27:17

solution2
0 2018-06-29 02:34:42