dplyr::mutate_at() 带有外部变量并以它们的值为条件

Question

我有一个长格式的数据集（即每个ID有多个观察值）。 每个ID都包含多次访问，在这些访问中，个人被诊断出患有疾病（在玩具示例中，我显示 3，但在我的真实数据中，我有多达 30 次），这些访问被编码在连续的列中（ disease1-disease3 ）。 值 1 表示他们在 diagnostic_dt 时被diagnosis_dt出患有这种疾病，而 0 表示他们没有这种疾病。 对于每个ID ，我有兴趣总结他们在end_dt介于start_dt和diagnosis_dt之间的所有访问中是否患有任何疾病。 一些ID没有诊断信息，因此在各自的列中被编码为NA 。 我仍然想保留这些信息。

我的数据集的一个玩具示例如下：

library(dplyr)
library(data.table)

ex_dat <- data.frame(ID = c(rep("a",3),
                  rep("b",4),
                  rep("c",5)),
                  start_dt = as.Date(c(rep("2009-01-01",3),
                                       rep("2009-04-01",4),
                                       rep("2009-02-01",5))),
                  end_dt = as.Date(c(rep("2010-12-31",3),
                                rep("2011-03-31",4),
                                rep("2011-01-31",5))),
           diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
                            as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
                            rep(NA,5)),
           disease1 = c(c(1,0,0),
                        c(1,1,0,1),
                        rep(NA,5)),
           disease2 = c(c(1,1,0),
                        c(0,0,0,1),
                        rep(NA,5)),
           disease3 = c(c(0,0,0),
                        c(0,0,1,0),
                        rep(NA,5))
           )

所需的 output 是：

  ID disease1 disease2 disease3
1  a        0        1        0
2  b        1        0        1
3  c       NA       NA       NA

我已经尝试了几个小时了，我最近的尝试是：

out <- ex_dat %>% group_by(ID) %>%
           mutate_at(vars(disease1:disease3),
                     function(x) ifelse(!is.na(.$diagnosis_dt) & 
                                          between(.$diagnosis_dt,.$start_dt,.$end_dt) & 
                                          sum(x)>0,
                                        1,0)) %>%
           slice(1) %>%
           select(ID,disease1:disease3)

Answer 1

这是一个tidyverse解决方案，使用filter消除不满足所需条件的行，然后使用complete完成具有 NA 的缺失组。

library(tidyverse)

ex_dat %>%
  #Group by ID 
  group_by(ID) %>%
  # Stay with the rows for which diagnosis_dt is between start_dt and end_dt
  filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
  # summarize all variables that start with disease by taking its max value
  summarize_at(vars(starts_with("disease")), max) %>%
  # Complete the missing IDs, those that only had NA or did not meet the criteria in  
  # the filter
  complete(ID)

# A tibble: 3 x 4
#  ID    disease1 disease2 disease3
# <fct>    <dbl>    <dbl>    <dbl>
# 1 a            0        1        0
# 2 b            1        0        1
# 3 c           NA       NA       NA

Answer 2

这是dplyr across功能（版本> = 1.0.0）的一种方法：

library(dplyr)
ex_dat %>%
  group_by(ID) %>%
  summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
                   ~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
                             1, 0)))
## A tibble: 3 x 4
#  ID    disease1 disease2 disease3
#  <fct>    <dbl>    <dbl>    <dbl>
#1 a            0        1        0
#2 b            1        0        1
#3 c           NA       NA       NA

请注意，在 integer 列上使用&运算符. 转换为逻辑。 我正在使用-one_of tidyselect 动词，因为这样我们甚至不需要知道有多少种疾病。 主动被group_by -ed 的列会被自动排除。

您的版本不起作用，因为 1）您需要汇总，而不是变异，以及 2）在 function 调用. 指的是正在处理的列，而不是来自管道的数据。 相反，您需要在没有$的情况下从调用环境访问这些列。

dplyr::mutate_at() 带有外部变量并以它们的值为条件

问题描述

2 个解决方案

解决方案1
2 2020-06-16 20:53:20

解决方案2
1 已采纳 2020-06-16 20:44:58

dplyr::mutate_at() 带有外部变量并以它们的值为条件

问题描述

2 个解决方案

解决方案1 2 2020-06-16 20:53:20

解决方案2 1 已采纳 2020-06-16 20:44:58

解决方案1
2 2020-06-16 20:53:20

解决方案2
1 已采纳 2020-06-16 20:44:58