在 R 中使用 Group by 和 mutate、case_when、any() 和 all() function

Question

I have a status_df with id and status at each stage:我在每个阶段都有一个带有 id 和状态的status_df ：

id ID	stage阶段	status地位
15 15	1 1	Pending待办的
15 15	2 2	Not Sent未发送
16 16	1 1	Approved得到正式认可的
16 16	2 2	Rejected被拒绝
16 16	3 3	Not Sent未发送
16 16	4 4	Not Sent未发送
20 20	1 1	Approved得到正式认可的
20 20	2 2	Approved得到正式认可的
20 20	3 3	Approved得到正式认可的

I am trying to do a group_by ID and apply the following logic:我正在尝试做一个 group_by ID 并应用以下逻辑：

if any stage for an ID has 'Pending' status, final_status column is 'Pending'如果 ID 的any阶段具有“待定”状态，则 final_status 列是“待定”
if any stage for an ID has 'Rejected' status, final_status column is 'Rejected'如果 ID 的any阶段具有“拒绝”状态，则 final_status 列为“拒绝”
if all stages for an ID are approved, final_status column is 'Approved'如果 ID 的all阶段都已批准，则 final_status 列为“已批准”

I am trying this (not working):我正在尝试这个（不工作）：

final_status_df = status_df %>% select(id, status) %>% group_by(id) %>%
mutate(final_status = case_when(any(status)=="Pending" ~ "Pending",
any(status)=="Rejected" ~ "Rejected", 
all(status)=="Approved" ~ "Approved"))

Expected output (final_status_df):预期 output (final_status_df)：

id ID	final_status final_status
15 15	Pending待办的
16 16	Rejected被拒绝
20 20	Approved得到正式认可的

Answer 1

You were in the right direction with your attempt however, you closed any / all brackets early before comparison ( == ).您的尝试是正确的方向，但是，您在比较之前关闭了any / all括号（ == ）。 Also since you only want 1 row for every id you can use summarise instead of mutate which will also avoid the use of select .此外，由于您只希望每个id有 1 行，因此您可以使用summarise而不是mutate ，这也将避免使用select 。

library(dplyr)

status_df %>% 
  group_by(id) %>%
  summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
                                     any(status == "Rejected") ~ "Rejected", 
                                     all(status == "Approved") ~ "Approved"))

#    id final_status
#* <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved

Answer 2

We can use summarise instead of mutate (as mutate returns the output column with the same length as the input column and it is used to create/modify a column instead of summarising).我们可以使用summarise而不是mutate （因为mutate返回 output 列，其length与输入列相同，它用于创建/修改列而不是汇总）。

Also, an easier option is to convert to factor with levels specified in the custom order, drop the unused levels ( droplevels ) and select the first levels after grouping by 'id'此外，一个更简单的选择是转换为自定义顺序中指定的levels的factor ，删除未使用的级别 ( droplevels ) 和 select 按“id”分组后的first levels

library(dplyr)
status_df %>%
    group_by(id) %>%
    summarise(final_status = first(levels(droplevels(factor(status, 
          levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')

-output -输出

# A tibble: 3 x 2
#     id final_status
#  <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved

In the OP's code, any(status) returns NA , instead it should be wrapped on a logical vector ie any(status == "Pending") .在 OP 的代码中， any(status)返回NA ，相反它应该被包装在一个逻辑向量上，即any(status == "Pending") 。 Also, as mentioned above, it would be summarise instead of mutate此外，如上所述，它将是summarise而不是mutate

data数据

status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L, 
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending", 
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved", 
"Approved", "Approved")), class = "data.frame", row.names = c(NA, 
-9L))

在 R 中使用 Group by 和 mutate、case_when、any() 和 all() function

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-04 03:46:35

解决方案2
1 2021-02-04 00:02:26

data数据

在 R 中使用 Group by 和 mutate、case_when、any() 和 all() function

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-04 03:46:35

解决方案2 1 2021-02-04 00:02:26

data数据

解决方案1
2 已采纳 2021-02-04 03:46:35

解决方案2
1 2021-02-04 00:02:26