简体   繁体   中英

Use Group by with mutate, case_when, any() and all() function in R

I have a status_df with id and status at each stage:

id stage status
15 1 Pending
15 2 Not Sent
16 1 Approved
16 2 Rejected
16 3 Not Sent
16 4 Not Sent
20 1 Approved
20 2 Approved
20 3 Approved

I am trying to do a group_by ID and apply the following logic:

  • if any stage for an ID has 'Pending' status, final_status column is 'Pending'
  • if any stage for an ID has 'Rejected' status, final_status column is 'Rejected'
  • if all stages for an ID are approved, final_status column is 'Approved'

I am trying this (not working):

final_status_df = status_df %>% select(id, status) %>% group_by(id) %>%
mutate(final_status = case_when(any(status)=="Pending" ~ "Pending",
any(status)=="Rejected" ~ "Rejected", 
all(status)=="Approved" ~ "Approved"))

Expected output (final_status_df):

id final_status
15 Pending
16 Rejected
20 Approved

You were in the right direction with your attempt however, you closed any / all brackets early before comparison ( == ). Also since you only want 1 row for every id you can use summarise instead of mutate which will also avoid the use of select .

library(dplyr)

status_df %>% 
  group_by(id) %>%
  summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
                                     any(status == "Rejected") ~ "Rejected", 
                                     all(status == "Approved") ~ "Approved"))

#    id final_status
#* <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

We can use summarise instead of mutate (as mutate returns the output column with the same length as the input column and it is used to create/modify a column instead of summarising).

Also, an easier option is to convert to factor with levels specified in the custom order, drop the unused levels ( droplevels ) and select the first levels after grouping by 'id'

library(dplyr)
status_df %>%
    group_by(id) %>%
    summarise(final_status = first(levels(droplevels(factor(status, 
          levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')

-output

# A tibble: 3 x 2
#     id final_status
#  <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

In the OP's code, any(status) returns NA , instead it should be wrapped on a logical vector ie any(status == "Pending") . Also, as mentioned above, it would be summarise instead of mutate

data

status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L, 
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending", 
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved", 
"Approved", "Approved")), class = "data.frame", row.names = c(NA, 
-9L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM