简体   繁体   中英

Summarise using multiple functions with dplyr across()

I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.

Below is an example of what I mean:

require(tidyverse)

df <- tibble(id = c(1,1,2,3,4,4,4),
             col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
             col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
             col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
             col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
             )

I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:

# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>% 
  group_by(id) %>% 
  mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>% 
  ungroup()

For simplicity I can now take one row for each id:

# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))

Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:

consistency <- df %>% 
  summarise(across(contains('distinct'), ~sum(.>1) / n(.)))

But this gives the following error, which I am having trouble interpreting:

Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.

I can get the answer I want by doing the following:

# calculate consistency for each column by finding the number of distinct values greater 
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>% 
  summarise(across(.cols = contains('distinct'), ~sum(.>1)))

# next get the number of rows
n_total <- nrow(df)

# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>% 
  mutate(across(contains('distinct'), ~./n_total))

But this involves intermediate variables and feels inelegant.

You can do it in the following way :

library(dplyr)

df %>%
  group_by(id) %>%
  summarise(across(starts_with('col'), n_distinct)) %>%
  summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))

#  col1_distinct col2_distinct col3_distinct col4_distinct
#          <dbl>         <dbl>         <dbl>         <dbl>
#1             0          0.25          0.25           0.5

First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM