Summarise using multiple functions with dplyr across()

Question

I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.

Below is an example of what I mean:

require(tidyverse)

df <- tibble(id = c(1,1,2,3,4,4,4),
             col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
             col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
             col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
             col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
             )

I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:

# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>% 
  group_by(id) %>% 
  mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>% 
  ungroup()

For simplicity I can now take one row for each id:

# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))

Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:

consistency <- df %>% 
  summarise(across(contains('distinct'), ~sum(.>1) / n(.)))

But this gives the following error, which I am having trouble interpreting:

Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.

I can get the answer I want by doing the following:

# calculate consistency for each column by finding the number of distinct values greater 
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>% 
  summarise(across(.cols = contains('distinct'), ~sum(.>1)))

# next get the number of rows
n_total <- nrow(df)

# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>% 
  mutate(across(contains('distinct'), ~./n_total))

But this involves intermediate variables and feels inelegant.

Answer 1

You can do it in the following way :

library(dplyr)

df %>%
  group_by(id) %>%
  summarise(across(starts_with('col'), n_distinct)) %>%
  summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))

#  col1_distinct col2_distinct col3_distinct col4_distinct
#          <dbl>         <dbl>         <dbl>         <dbl>
#1             0          0.25          0.25           0.5

First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.

Summarise using multiple functions with dplyr across()

Question

1 answers

solution1
2 ACCPTED 2020-11-06 03:51:00

Summarise using multiple functions with dplyr across()

Question

1 answers

solution1 2 ACCPTED 2020-11-06 03:51:00

solution1
2 ACCPTED 2020-11-06 03:51:00