I have a dataset in long-format (ie multiple observations per ID
). Each ID
contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns ( disease1-disease3
). A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt
, and 0 means the did not have it. For each ID
, I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt
falls between start_dt
and end_dt
. Some ID
s don't have diagnosis information, and consequently are coded as NA
s in the respective columns. I'd still like to keep this information.
A toy example of my dataset is below:
library(dplyr)
library(data.table)
ex_dat <- data.frame(ID = c(rep("a",3),
rep("b",4),
rep("c",5)),
start_dt = as.Date(c(rep("2009-01-01",3),
rep("2009-04-01",4),
rep("2009-02-01",5))),
end_dt = as.Date(c(rep("2010-12-31",3),
rep("2011-03-31",4),
rep("2011-01-31",5))),
diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
rep(NA,5)),
disease1 = c(c(1,0,0),
c(1,1,0,1),
rep(NA,5)),
disease2 = c(c(1,1,0),
c(0,0,0,1),
rep(NA,5)),
disease3 = c(c(0,0,0),
c(0,0,1,0),
rep(NA,5))
)
The desired output is:
ID disease1 disease2 disease3
1 a 0 1 0
2 b 1 0 1
3 c NA NA NA
I've been trying this for hours now and my latest attempt is:
out <- ex_dat %>% group_by(ID) %>%
mutate_at(vars(disease1:disease3),
function(x) ifelse(!is.na(.$diagnosis_dt) &
between(.$diagnosis_dt,.$start_dt,.$end_dt) &
sum(x)>0,
1,0)) %>%
slice(1) %>%
select(ID,disease1:disease3)
Here is a tidyverse
solution using filter
to eliminate the rows that do not meet the desired condition and then use complete
to complete the missing groups with NA.
library(tidyverse)
ex_dat %>%
#Group by ID
group_by(ID) %>%
# Stay with the rows for which diagnosis_dt is between start_dt and end_dt
filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
# summarize all variables that start with disease by taking its max value
summarize_at(vars(starts_with("disease")), max) %>%
# Complete the missing IDs, those that only had NA or did not meet the criteria in
# the filter
complete(ID)
# A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
# 1 a 0 1 0
# 2 b 1 0 1
# 3 c NA NA NA
Here's an approach with the dplyr
across
functionality (version >= 1.0.0):
library(dplyr)
ex_dat %>%
group_by(ID) %>%
summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
1, 0)))
## A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
#1 a 0 1 0
#2 b 1 0 1
#3 c NA NA NA
Note that using the &
operator on the integer column .
converts to logical. I'm using the -one_of
tidyselect verb because then we don't even need to know how many diseases there are. The columns that are actively being group_by
-ed are automatically excluded.
Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call .
refers to the column that is being worked on, not the data from piping. Instead, you need to access those columns without $
from the calling environment.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.