Use Group by with mutate, case_when, any() and all() function in R

Question

I have a status_df with id and status at each stage:

id	stage	status
15	1	Pending
15	2	Not Sent
16	1	Approved
16	2	Rejected
16	3	Not Sent
16	4	Not Sent
20	1	Approved
20	2	Approved
20	3	Approved

I am trying to do a group_by ID and apply the following logic:

if any stage for an ID has 'Pending' status, final_status column is 'Pending'
if any stage for an ID has 'Rejected' status, final_status column is 'Rejected'
if all stages for an ID are approved, final_status column is 'Approved'

I am trying this (not working):

final_status_df = status_df %>% select(id, status) %>% group_by(id) %>%
mutate(final_status = case_when(any(status)=="Pending" ~ "Pending",
any(status)=="Rejected" ~ "Rejected", 
all(status)=="Approved" ~ "Approved"))

Expected output (final_status_df):

id	final_status
15	Pending
16	Rejected
20	Approved

Answer 1

You were in the right direction with your attempt however, you closed any / all brackets early before comparison ( == ). Also since you only want 1 row for every id you can use summarise instead of mutate which will also avoid the use of select .

library(dplyr)

status_df %>% 
  group_by(id) %>%
  summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
                                     any(status == "Rejected") ~ "Rejected", 
                                     all(status == "Approved") ~ "Approved"))

#    id final_status
#* <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved

Answer 2

We can use summarise instead of mutate (as mutate returns the output column with the same length as the input column and it is used to create/modify a column instead of summarising).

Also, an easier option is to convert to factor with levels specified in the custom order, drop the unused levels ( droplevels ) and select the first levels after grouping by 'id'

library(dplyr)
status_df %>%
    group_by(id) %>%
    summarise(final_status = first(levels(droplevels(factor(status, 
          levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')

-output

# A tibble: 3 x 2
#     id final_status
#  <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved

In the OP's code, any(status) returns NA , instead it should be wrapped on a logical vector ie any(status == "Pending") . Also, as mentioned above, it would be summarise instead of mutate

data

status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L, 
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending", 
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved", 
"Approved", "Approved")), class = "data.frame", row.names = c(NA, 
-9L))

Use Group by with mutate, case_when, any() and all() function in R

Question

2 answers

solution1
2 ACCPTED 2021-02-04 03:46:35

solution2
1 2021-02-04 00:02:26

data

Use Group by with mutate, case_when, any() and all() function in R

Question

2 answers

solution1 2 ACCPTED 2021-02-04 03:46:35

solution2 1 2021-02-04 00:02:26

data

solution1
2 ACCPTED 2021-02-04 03:46:35

solution2
1 2021-02-04 00:02:26