简体   繁体   English

dplyr::mutate_at() 带有外部变量并以它们的值为条件

[英]dplyr::mutate_at() with external variables and conditional on their values

I have a dataset in long-format (ie multiple observations per ID ).我有一个长格式的数据集(即每个ID有多个观察值)。 Each ID contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns ( disease1-disease3 ).每个ID都包含多次访问,在这些访问中,个人被诊断出患有疾病(在玩具示例中,我显示 3,但在我的真实数据中,我有多达 30 次),这些访问被编码在连续的列中( disease1-disease3 )。 A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt , and 0 means the did not have it.值 1 表示他们在 diagnostic_dt 时被diagnosis_dt出患有这种疾病,而 0 表示他们没有这种疾病。 For each ID , I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt falls between start_dt and end_dt .对于每个ID ,我有兴趣总结他们在end_dt介于start_dtdiagnosis_dt之间的所有访问中是否患有任何疾病。 Some ID s don't have diagnosis information, and consequently are coded as NA s in the respective columns.一些ID没有诊断信息,因此在各自的列中被编码为NA I'd still like to keep this information.我仍然想保留这些信息。

A toy example of my dataset is below:我的数据集的一个玩具示例如下:

library(dplyr)
library(data.table)

ex_dat <- data.frame(ID = c(rep("a",3),
                  rep("b",4),
                  rep("c",5)),
                  start_dt = as.Date(c(rep("2009-01-01",3),
                                       rep("2009-04-01",4),
                                       rep("2009-02-01",5))),
                  end_dt = as.Date(c(rep("2010-12-31",3),
                                rep("2011-03-31",4),
                                rep("2011-01-31",5))),
           diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
                            as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
                            rep(NA,5)),
           disease1 = c(c(1,0,0),
                        c(1,1,0,1),
                        rep(NA,5)),
           disease2 = c(c(1,1,0),
                        c(0,0,0,1),
                        rep(NA,5)),
           disease3 = c(c(0,0,0),
                        c(0,0,1,0),
                        rep(NA,5))
           )

The desired output is:所需的 output 是:

  ID disease1 disease2 disease3
1  a        0        1        0
2  b        1        0        1
3  c       NA       NA       NA

I've been trying this for hours now and my latest attempt is:我已经尝试了几个小时了,我最近的尝试是:

out <- ex_dat %>% group_by(ID) %>%
           mutate_at(vars(disease1:disease3),
                     function(x) ifelse(!is.na(.$diagnosis_dt) & 
                                          between(.$diagnosis_dt,.$start_dt,.$end_dt) & 
                                          sum(x)>0,
                                        1,0)) %>%
           slice(1) %>%
           select(ID,disease1:disease3)

Here is a tidyverse solution using filter to eliminate the rows that do not meet the desired condition and then use complete to complete the missing groups with NA.这是一个tidyverse解决方案,使用filter消除不满足所需条件的行,然后使用complete完成具有 NA 的缺失组。

library(tidyverse)

ex_dat %>%
  #Group by ID 
  group_by(ID) %>%
  # Stay with the rows for which diagnosis_dt is between start_dt and end_dt
  filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
  # summarize all variables that start with disease by taking its max value
  summarize_at(vars(starts_with("disease")), max) %>%
  # Complete the missing IDs, those that only had NA or did not meet the criteria in  
  # the filter
  complete(ID)

# A tibble: 3 x 4
#  ID    disease1 disease2 disease3
# <fct>    <dbl>    <dbl>    <dbl>
# 1 a            0        1        0
# 2 b            1        0        1
# 3 c           NA       NA       NA

Here's an approach with the dplyr across functionality (version >= 1.0.0):这是dplyr across功能(版本> = 1.0.0)的一种方法:

library(dplyr)
ex_dat %>%
  group_by(ID) %>%
  summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
                   ~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
                             1, 0)))
## A tibble: 3 x 4
#  ID    disease1 disease2 disease3
#  <fct>    <dbl>    <dbl>    <dbl>
#1 a            0        1        0
#2 b            1        0        1
#3 c           NA       NA       NA

Note that using the & operator on the integer column .请注意,在 integer 列上使用&运算符. converts to logical.转换为逻辑。 I'm using the -one_of tidyselect verb because then we don't even need to know how many diseases there are.我正在使用-one_of tidyselect 动词,因为这样我们甚至不需要知道有多少种疾病。 The columns that are actively being group_by -ed are automatically excluded.主动被group_by -ed 的列会被自动排除。

Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call .您的版本不起作用,因为 1)您需要汇总,而不是变异,以及 2)在 function 调用. refers to the column that is being worked on, not the data from piping.指的是正在处理的列,而不是来自管道的数据。 Instead, you need to access those columns without $ from the calling environment.相反,您需要在没有$的情况下从调用环境访问这些列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM