简体   繁体   中英

R dplyr to Conditionally Aggregate Based on Group and Column

I have a data set with subject ID's where I'm taking hourly measurements based on different criteria ("metrics"). Each of these metrics either does or does not meet compliance (1 for yes, 0 for no). I have two goals:

Tasks

(1) Determine "full compliance" ie a subject has 1's for every metric for every hour. If there is a zero, the subject fails to meet "full compliance".

(2) Determine the percentage of compliance for each subject. Ie if a subject has 3/4 hours of all 1's then they are 75% compliant.

I would like to solve this using dplyr and combinations of filter() or some aggregate function I may not be familiar with.

Example

Here is an example dataset:

df <- data.frame(
  "Subject ID" = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
  "Metric 1" = rep("value", 12),
  "Compliance 1" = rep(1, 12),
  "Metric 2" = rep("value", 12),
  "Compliance 2" = c(1,1,1,1,1,0,1,1,1,1,1,1),
  "Metric 3" = rep("value", 12),
  "Compliance 3" = c(1,1,1,1,0,1,1,1,1,1,1,1)
)

And here is what I would like to use dplyr to achieve for point (1):

df2 <- data.frame(
  "Subject ID" = c("A", "A", "A", "A", "C", "C", "C", "C"),
  "Metric 1" = rep("value", 8),
  "Compliance 1" = rep(1, 8),
  "Metric 2" = rep("value", 8),
  "Compliance 2" = rep(1, 8),
  "Metric 3" = rep("value", 8),
  "Compliance 3" = rep(1, 8)
)

For (2) I would like to aggregate the data in a way to show that Subject (A) had 100% compliance but (B) had 50% compliance.

Edited Answer

Thanks to tmfmnk I was able to get to the condensed percentages using:

df %>%
  mutate(cond = rowMeans(select(., starts_with("Compliance"))) == 1) %>%
  group_by(Subject.ID) %>% 
  tally(cond) %>% 
  mutate(perc = n/4)

  Subject.ID     n  perc
  <fct>      <int> <dbl>
1 A              4   1  
2 B              2   0.5
3 C              4   1  

One possibility could be:

df %>%
 mutate(cond = rowMeans(select(., starts_with("Compliance"))) == 1) %>%
 group_by(Subject.ID) %>%
 filter(all(cond)) %>%
 select(-cond)

      Subject.ID Metric.1 Compliance.1 Metric.2 Compliance.2 Metric.3 Compliance.3
    1          A    value            1    value            1    value            1
    2          A    value            1    value            1    value            1
    3          A    value            1    value            1    value            1
    4          A    value            1    value            1    value            1
    5          C    value            1    value            1    value            1
    6          C    value            1    value            1    value            1
    7          C    value            1    value            1    value            1
    8          C    value            1    value            1    value            1

If you want the percentage, then you can do:

df %>%
 mutate(cond = rowMeans(select(., starts_with("Compliance")))) 

   Subject.ID Metric.1 Compliance.1 Metric.2 Compliance.2 Metric.3 Compliance.3      cond
1           A    value            1    value            1    value            1 1.0000000
2           A    value            1    value            1    value            1 1.0000000
3           A    value            1    value            1    value            1 1.0000000
4           A    value            1    value            1    value            1 1.0000000
5           B    value            1    value            1    value            0 0.6666667
6           B    value            1    value            0    value            1 0.6666667
7           B    value            1    value            1    value            1 1.0000000
8           B    value            1    value            1    value            1 1.0000000
9           C    value            1    value            1    value            1 1.0000000
10          C    value            1    value            1    value            1 1.0000000
11          C    value            1    value            1    value            1 1.0000000
12          C    value            1    value            1    value            1 1.0000000

To get a percentage of cases fulfilling the condition per group (as proposed by @rsh52):

df %>%
 mutate(cond = rowMeans(select(., starts_with("Compliance"))) == 1) %>%
 group_by(Subject.ID) %>% 
 tally(cond) %>% 
 mutate(perc = n/4)

  Subject.ID     n  perc
  <fct>      <int> <dbl>
1 A              4   1  
2 B              2   0.5
3 C              4   1  

This answer is a bit lengthy, but the logic should be easy enough to follow -- sum the 1's per compliance column, calculate the compliance percentage, remove subjects with <100% compliance on the 3 compliance columns. Then filter out those subjects from the original dataframe.

compliant <- df %>% select(Subject.ID,matches('Compliance')) %>% 
  gather(k,v,-Subject.ID) %>% 
  group_by(Subject.ID,k) %>% 
  summarise(sum=sum(v),n=n()) %>% 
  group_by(Subject.ID,k) %>%
  mutate(perc=sum/sum(n)) %>% 
  group_by(Subject.ID) %>% summarise(c=sum(perc)) %>% 
  filter(c==3)

# A tibble: 2 x 2
  Subject.ID     c
  <fct>      <dbl>
1 A              3
2 C              3

df %>% filter(Subject.ID %in% compliant$Subject.ID)

  Subject.ID Metric.1 Compliance.1 Metric.2 Compliance.2 Metric.3 Compliance.3
1          A    value            1    value            1    value            1
2          A    value            1    value            1    value            1
3          A    value            1    value            1    value            1
4          A    value            1    value            1    value            1
5          C    value            1    value            1    value            1
6          C    value            1    value            1    value            1
7          C    value            1    value            1    value            1
8          C    value            1    value            1    value            1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM