[英]dplyr::mutate_at() with external variables and conditional on their values
I have a dataset in long-format (ie multiple observations per ID
).我有一个长格式的数据集(即每个
ID
有多个观察值)。 Each ID
contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns ( disease1-disease3
).每个
ID
都包含多次访问,在这些访问中,个人被诊断出患有疾病(在玩具示例中,我显示 3,但在我的真实数据中,我有多达 30 次),这些访问被编码在连续的列中( disease1-disease3
)。 A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt
, and 0 means the did not have it.值 1 表示他们在 diagnostic_dt 时被
diagnosis_dt
出患有这种疾病,而 0 表示他们没有这种疾病。 For each ID
, I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt
falls between start_dt
and end_dt
.对于每个
ID
,我有兴趣总结他们在end_dt
介于start_dt
和diagnosis_dt
之间的所有访问中是否患有任何疾病。 Some ID
s don't have diagnosis information, and consequently are coded as NA
s in the respective columns.一些
ID
没有诊断信息,因此在各自的列中被编码为NA
。 I'd still like to keep this information.我仍然想保留这些信息。
A toy example of my dataset is below:我的数据集的一个玩具示例如下:
library(dplyr)
library(data.table)
ex_dat <- data.frame(ID = c(rep("a",3),
rep("b",4),
rep("c",5)),
start_dt = as.Date(c(rep("2009-01-01",3),
rep("2009-04-01",4),
rep("2009-02-01",5))),
end_dt = as.Date(c(rep("2010-12-31",3),
rep("2011-03-31",4),
rep("2011-01-31",5))),
diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
rep(NA,5)),
disease1 = c(c(1,0,0),
c(1,1,0,1),
rep(NA,5)),
disease2 = c(c(1,1,0),
c(0,0,0,1),
rep(NA,5)),
disease3 = c(c(0,0,0),
c(0,0,1,0),
rep(NA,5))
)
The desired output is:所需的 output 是:
ID disease1 disease2 disease3
1 a 0 1 0
2 b 1 0 1
3 c NA NA NA
I've been trying this for hours now and my latest attempt is:我已经尝试了几个小时了,我最近的尝试是:
out <- ex_dat %>% group_by(ID) %>%
mutate_at(vars(disease1:disease3),
function(x) ifelse(!is.na(.$diagnosis_dt) &
between(.$diagnosis_dt,.$start_dt,.$end_dt) &
sum(x)>0,
1,0)) %>%
slice(1) %>%
select(ID,disease1:disease3)
Here is a tidyverse
solution using filter
to eliminate the rows that do not meet the desired condition and then use complete
to complete the missing groups with NA.这是一个
tidyverse
解决方案,使用filter
消除不满足所需条件的行,然后使用complete
完成具有 NA 的缺失组。
library(tidyverse)
ex_dat %>%
#Group by ID
group_by(ID) %>%
# Stay with the rows for which diagnosis_dt is between start_dt and end_dt
filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
# summarize all variables that start with disease by taking its max value
summarize_at(vars(starts_with("disease")), max) %>%
# Complete the missing IDs, those that only had NA or did not meet the criteria in
# the filter
complete(ID)
# A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
# 1 a 0 1 0
# 2 b 1 0 1
# 3 c NA NA NA
Here's an approach with the dplyr
across
functionality (version >= 1.0.0):这是
dplyr
across
功能(版本> = 1.0.0)的一种方法:
library(dplyr)
ex_dat %>%
group_by(ID) %>%
summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
1, 0)))
## A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
#1 a 0 1 0
#2 b 1 0 1
#3 c NA NA NA
Note that using the &
operator on the integer column .
请注意,在 integer 列上使用
&
运算符.
converts to logical.转换为逻辑。 I'm using the
-one_of
tidyselect verb because then we don't even need to know how many diseases there are.我正在使用
-one_of
tidyselect 动词,因为这样我们甚至不需要知道有多少种疾病。 The columns that are actively being group_by
-ed are automatically excluded.主动被
group_by
-ed 的列会被自动排除。
Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call .
您的版本不起作用,因为 1)您需要汇总,而不是变异,以及 2)在 function 调用
.
refers to the column that is being worked on, not the data from piping.指的是正在处理的列,而不是来自管道的数据。 Instead, you need to access those columns without
$
from the calling environment.相反,您需要在没有
$
的情况下从调用环境访问这些列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.