How to summarize across multiple columns with condition on another (grouped) column with dplyr?

Question

I need to summarize a data.frame across multiple columns in a generic way:

the first summarize operation is easy, eg a simple median, and is straightforward;
the second summarize then includes a condition on another column, eg taking the value where these is a minimum (by group) in another column:

set.seed(4)

myDF = data.frame(i = rep(1:3, each=3),
                  j = rnorm(9),
                  a = sample.int(9),
                  b = sample.int(9),
                  c = sample.int(9),
                  d = 'foo')
#   i          j a b c   d
# 1 1  0.2167549 4 5 5 foo
# 2 1 -0.5424926 7 7 4 foo
# 3 1  0.8911446 3 9 1 foo
# 4 2  0.5959806 8 6 8 foo
# 5 2  1.6356180 6 8 3 foo
# 6 2  0.6892754 1 4 6 foo
# 7 3 -1.2812466 9 1 7 foo
# 8 3 -0.2131445 5 2 2 foo
# 9 3  1.8965399 2 3 9 foo

myDF %>% group_by(i) %>% summarize(across(where(is.numeric), median, .names="med_{col}"),
                                   best_a = a[[which.min(j)]],
                                   best_b = b[[which.min(j)]],
                                   best_c = c[[which.min(j)]])
# # A tibble: 3 x 8
#      i   med_j med_a med_b med_c best_a best_b best_c
# * <int>   <dbl> <int> <int> <int>  <int>  <int>  <int>
# 1     1  0.217     4     7     4      7      7      4
# 2     2  0.689     6     6     6      8      6      8
# 3     3 -0.213     5     2     7      9      1      7

How can I define this second summarize operation in a generic way (ie, not manually as done above)?

Hence I would need something like this (which obviously does not work as j is not recognized):

myfns = list(med = ~median(.),
             best = ~.[[which.min(j)]])
myDF %>% group_by(i) %>% summarize(across(where(is.numeric), myfns, .names="{fn}_{col}"))
# Error: Problem with `summarise()` input `..1`.
# x object 'j' not found
# ℹ Input `..1` is `across(where(is.numeric), myfns, .names = "{fn}_{col}")`.
# ℹ The error occurred in group 1: i = 1.

Answer 1

Use another across to get corresponding values in column a:c where j is minimum.

library(dplyr)

myDF %>% 
  group_by(i) %>% 
  summarize(across(where(is.numeric), median, .names="med_{col}"),
            across(a:c,  ~.[which.min(j)],.names = 'best_{col}'))

#      i  med_j med_a med_b med_c best_a best_b best_c
#* <int>  <dbl> <int> <int> <int>  <int>  <int>  <int>
#1     1  0.217     4     7     4      7      7      4
#2     2  0.689     6     6     6      8      6      8
#3     3 -0.213     5     2     7      9      1      7

To do it in the same across statement:

myDF %>% 
  group_by(i) %>% 
  summarize(across(where(is.numeric), list(med = median, 
                                           best = ~.[which.min(j)]), 
                                      .names="{fn}_{col}"))

How to summarize across multiple columns with condition on another (grouped) column with dplyr?

Question

1 answers

solution1
1 ACCPTED 2021-02-09 09:43:11

How to summarize across multiple columns with condition on another (grouped) column with dplyr?

Question

1 answers

solution1 1 ACCPTED 2021-02-09 09:43:11

solution1
1 ACCPTED 2021-02-09 09:43:11