[英]Use Group by with mutate, case_when, any() and all() function in R
I have a status_df
with id and status at each stage:我在每个阶段都有一个带有 id 和状态的
status_df
:
id ![]() |
stage![]() |
status![]() |
---|---|---|
15 ![]() |
1 ![]() |
Pending![]() |
15 ![]() |
2 ![]() |
Not Sent![]() |
16 ![]() |
1 ![]() |
Approved![]() |
16 ![]() |
2 ![]() |
Rejected![]() |
16 ![]() |
3 ![]() |
Not Sent![]() |
16 ![]() |
4 ![]() |
Not Sent![]() |
20 ![]() |
1 ![]() |
Approved![]() |
20 ![]() |
2 ![]() |
Approved![]() |
20 ![]() |
3 ![]() |
Approved![]() |
I am trying to do a group_by ID and apply the following logic:我正在尝试做一个 group_by ID 并应用以下逻辑:
any
stage for an ID has 'Pending' status, final_status column is 'Pending'any
阶段具有“待定”状态,则 final_status 列是“待定”any
stage for an ID has 'Rejected' status, final_status column is 'Rejected'any
阶段具有“拒绝”状态,则 final_status 列为“拒绝”all
stages for an ID are approved, final_status column is 'Approved'all
阶段都已批准,则 final_status 列为“已批准” I am trying this (not working):我正在尝试这个(不工作):
final_status_df = status_df %>% select(id, status) %>% group_by(id) %>%
mutate(final_status = case_when(any(status)=="Pending" ~ "Pending",
any(status)=="Rejected" ~ "Rejected",
all(status)=="Approved" ~ "Approved"))
Expected output (final_status_df):预期 output (final_status_df):
id ![]() |
final_status ![]() |
---|---|
15 ![]() |
Pending![]() |
16 ![]() |
Rejected![]() |
20 ![]() |
Approved![]() |
You were in the right direction with your attempt however, you closed any
/ all
brackets early before comparison ( ==
).您的尝试是正确的方向,但是,您在比较之前关闭了
any
/ all
括号( ==
)。 Also since you only want 1 row for every id
you can use summarise
instead of mutate
which will also avoid the use of select
.此外,由于您只希望每个
id
有 1 行,因此您可以使用summarise
而不是mutate
,这也将避免使用select
。
library(dplyr)
status_df %>%
group_by(id) %>%
summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
any(status == "Rejected") ~ "Rejected",
all(status == "Approved") ~ "Approved"))
# id final_status
#* <int> <chr>
#1 15 Pending
#2 16 Rejected
#3 20 Approved
We can use summarise
instead of mutate
(as mutate
returns the output column with the same length
as the input column and it is used to create/modify a column instead of summarising).我们可以使用
summarise
而不是mutate
(因为mutate
返回 output 列,其length
与输入列相同,它用于创建/修改列而不是汇总)。
Also, an easier option is to convert to factor
with levels
specified in the custom order, drop the unused levels ( droplevels
) and select the first
levels
after grouping by 'id'此外,一个更简单的选择是转换为自定义顺序中指定的
levels
的factor
,删除未使用的级别 ( droplevels
) 和 select 按“id”分组后的first
levels
library(dplyr)
status_df %>%
group_by(id) %>%
summarise(final_status = first(levels(droplevels(factor(status,
levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')
-output -输出
# A tibble: 3 x 2
# id final_status
# <int> <chr>
#1 15 Pending
#2 16 Rejected
#3 20 Approved
In the OP's code, any(status)
returns NA
, instead it should be wrapped on a logical vector ie any(status == "Pending")
.在 OP 的代码中,
any(status)
返回NA
,相反它应该被包装在一个逻辑向量上,即any(status == "Pending")
。 Also, as mentioned above, it would be summarise
instead of mutate
此外,如上所述,它将是
summarise
而不是mutate
status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L,
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending",
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved",
"Approved", "Approved")), class = "data.frame", row.names = c(NA,
-9L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.