简体   繁体   中英

dplyr::mutate_at() with external variables and conditional on their values

I have a dataset in long-format (ie multiple observations per ID ). Each ID contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns ( disease1-disease3 ). A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt , and 0 means the did not have it. For each ID , I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt falls between start_dt and end_dt . Some ID s don't have diagnosis information, and consequently are coded as NA s in the respective columns. I'd still like to keep this information.

A toy example of my dataset is below:

library(dplyr)
library(data.table)

ex_dat <- data.frame(ID = c(rep("a",3),
                  rep("b",4),
                  rep("c",5)),
                  start_dt = as.Date(c(rep("2009-01-01",3),
                                       rep("2009-04-01",4),
                                       rep("2009-02-01",5))),
                  end_dt = as.Date(c(rep("2010-12-31",3),
                                rep("2011-03-31",4),
                                rep("2011-01-31",5))),
           diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
                            as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
                            rep(NA,5)),
           disease1 = c(c(1,0,0),
                        c(1,1,0,1),
                        rep(NA,5)),
           disease2 = c(c(1,1,0),
                        c(0,0,0,1),
                        rep(NA,5)),
           disease3 = c(c(0,0,0),
                        c(0,0,1,0),
                        rep(NA,5))
           )

The desired output is:

  ID disease1 disease2 disease3
1  a        0        1        0
2  b        1        0        1
3  c       NA       NA       NA

I've been trying this for hours now and my latest attempt is:

out <- ex_dat %>% group_by(ID) %>%
           mutate_at(vars(disease1:disease3),
                     function(x) ifelse(!is.na(.$diagnosis_dt) & 
                                          between(.$diagnosis_dt,.$start_dt,.$end_dt) & 
                                          sum(x)>0,
                                        1,0)) %>%
           slice(1) %>%
           select(ID,disease1:disease3)

Here is a tidyverse solution using filter to eliminate the rows that do not meet the desired condition and then use complete to complete the missing groups with NA.

library(tidyverse)

ex_dat %>%
  #Group by ID 
  group_by(ID) %>%
  # Stay with the rows for which diagnosis_dt is between start_dt and end_dt
  filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
  # summarize all variables that start with disease by taking its max value
  summarize_at(vars(starts_with("disease")), max) %>%
  # Complete the missing IDs, those that only had NA or did not meet the criteria in  
  # the filter
  complete(ID)

# A tibble: 3 x 4
#  ID    disease1 disease2 disease3
# <fct>    <dbl>    <dbl>    <dbl>
# 1 a            0        1        0
# 2 b            1        0        1
# 3 c           NA       NA       NA

Here's an approach with the dplyr across functionality (version >= 1.0.0):

library(dplyr)
ex_dat %>%
  group_by(ID) %>%
  summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
                   ~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
                             1, 0)))
## A tibble: 3 x 4
#  ID    disease1 disease2 disease3
#  <fct>    <dbl>    <dbl>    <dbl>
#1 a            0        1        0
#2 b            1        0        1
#3 c           NA       NA       NA

Note that using the & operator on the integer column . converts to logical. I'm using the -one_of tidyselect verb because then we don't even need to know how many diseases there are. The columns that are actively being group_by -ed are automatically excluded.

Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call . refers to the column that is being worked on, not the data from piping. Instead, you need to access those columns without $ from the calling environment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM