简体   繁体   English

在 R 中使用 Group by 和 mutate、case_when、any() 和 all() function

[英]Use Group by with mutate, case_when, any() and all() function in R

I have a status_df with id and status at each stage:我在每个阶段都有一个带有 id 和状态的status_df

id ID stage阶段 status地位
15 15 1 1 Pending待办的
15 15 2 2 Not Sent未发送
16 16 1 1 Approved得到正式认可的
16 16 2 2 Rejected被拒绝
16 16 3 3 Not Sent未发送
16 16 4 4 Not Sent未发送
20 20 1 1 Approved得到正式认可的
20 20 2 2 Approved得到正式认可的
20 20 3 3 Approved得到正式认可的

I am trying to do a group_by ID and apply the following logic:我正在尝试做一个 group_by ID 并应用以下逻辑:

  • if any stage for an ID has 'Pending' status, final_status column is 'Pending'如果 ID 的any阶段具有“待定”状态,则 final_status 列是“待定”
  • if any stage for an ID has 'Rejected' status, final_status column is 'Rejected'如果 ID 的any阶段具有“拒绝”状态,则 final_status 列为“拒绝”
  • if all stages for an ID are approved, final_status column is 'Approved'如果 ID 的all阶段都已批准,则 final_status 列为“已批准”

I am trying this (not working):我正在尝试这个(不工作):

final_status_df = status_df %>% select(id, status) %>% group_by(id) %>%
mutate(final_status = case_when(any(status)=="Pending" ~ "Pending",
any(status)=="Rejected" ~ "Rejected", 
all(status)=="Approved" ~ "Approved"))

Expected output (final_status_df):预期 output (final_status_df):

id ID final_status final_status
15 15 Pending待办的
16 16 Rejected被拒绝
20 20 Approved得到正式认可的

You were in the right direction with your attempt however, you closed any / all brackets early before comparison ( == ).您的尝试是正确的方向,但是,您在比较之前关闭了any / all括号( == )。 Also since you only want 1 row for every id you can use summarise instead of mutate which will also avoid the use of select .此外,由于您只希望每个id有 1 行,因此您可以使用summarise而不是mutate ,这也将避免使用select

library(dplyr)

status_df %>% 
  group_by(id) %>%
  summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
                                     any(status == "Rejected") ~ "Rejected", 
                                     all(status == "Approved") ~ "Approved"))

#    id final_status
#* <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

We can use summarise instead of mutate (as mutate returns the output column with the same length as the input column and it is used to create/modify a column instead of summarising).我们可以使用summarise而不是mutate (因为mutate返回 output 列,其length与输入列相同,它用于创建/修改列而不是汇总)。

Also, an easier option is to convert to factor with levels specified in the custom order, drop the unused levels ( droplevels ) and select the first levels after grouping by 'id'此外,一个更简单的选择是转换为自定义顺序中指定的levelsfactor ,删除未使用的级别 ( droplevels ) 和 select 按“id”分组后的first levels

library(dplyr)
status_df %>%
    group_by(id) %>%
    summarise(final_status = first(levels(droplevels(factor(status, 
          levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')

-output -输出

# A tibble: 3 x 2
#     id final_status
#  <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

In the OP's code, any(status) returns NA , instead it should be wrapped on a logical vector ie any(status == "Pending") .在 OP 的代码中, any(status)返回NA ,相反它应该被包装在一个逻辑向量上,即any(status == "Pending") Also, as mentioned above, it would be summarise instead of mutate此外,如上所述,它将是summarise而不是mutate

data数据

status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L, 
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending", 
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved", 
"Approved", "Approved")), class = "data.frame", row.names = c(NA, 
-9L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM