简体   繁体   English

dplyr-排列,分组,计算日期差

[英]dplyr - arrange, group, compute difference in dates

I have a large dataset showing a follow-up of kids from a "healthy" event to subsequent "sick" events 我有一个庞大的数据集,显示了从“健康”事件到后续“病假”事件的孩子的跟进情况

I am trying to use dplyr to compute time between "healthy" event and first "sick" event 我正在尝试使用dplyr计算“健康”事件与第一个“病假”事件之间的时间

simulated dataset 模拟数据集

 id <- c(1,1,1,1,1,1) 
event <- c("healthy","","","sick","sick","")
date_follow_up <- c("4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/6/15")

df1 <- data_frame(id, event, date_follow_up)

simulated output dataset 模拟输出数据集

id <- c(1,1,1,1,1,1) 
event <- c("healthy","","","sick","sick","")
date_follow_up <- c("4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/6/15")
diff_time <- c(3,"","","","","")

df1 <- data_frame(id, event, date_follow_up, diff_time)

I've only been able to go as far as use dplyr to sort the data by "id" and "date_follow_up" then group by "id": 我只能使用dplyr按“ id”和“ date_follow_up”对数据进行排序,然后按“ id”对数据进行分组:

df2 <- df1 %>% arrange(id, date_follow_up) %>% group_by(id)

Kindly need help in computing the difference in date and adding it next to the row with the "healthy" event for each individual :) 请在计算日期差异并将其添加到每个人的“健康”事件的行旁边时需要帮助:)

Using @akrun's example data, here's one way using rolling joins from data.table : 使用@ akrun的数据。例如,下面是一个使用滚动单程从data.table 加入

require(data.table)
dt = as.data.table(mydf)[, date_follow_up := as.Date(date_follow_up, format="%m/%d/%y")][]
dt1 = dt[event == "healthy"]
dt2 = dt[event == "sick"]

idx = dt2[dt1, roll = -Inf, which = TRUE, on = c("id", "date_follow_up")]

The idea is: for every healthy date (in dt1 ), get the index of first sick date (in dt2 ) >= the healthy date. 这个想法是:对于每个健康日期(以dt1 ),获取第一个患病日期的索引(以dt2>=健康日期。

Then it's straightforward to subtract the two dates to get the final result. 然后,直接减去两个日期即可得出最终结果。

dt[event == "healthy", 
     diff := as.integer(dt2$date_follow_up[idx] - dt1$date_follow_up)]

I modified your data a bit more to examine this case thoroughly. 我还对您的数据进行了一些修改,以彻底检查这种情况。 My suggestion is similar to what alistaire suggested. 我的建议类似于利斯特主义者的建议。 My suggestion can produce NA for id 2 in mydf , whereas alistaire suggestion creates Inf. 我的建议可以为mydf id 2生成NA,而利斯特的建议可以创建Inf。 First, I converted your dates (in character) to Date objects.Then, I grouped the data by id , and calculated time difference by subtracting the first day of healthy (ie, date_follow_up[event == "healthy"][1] ) from the first day of sick (ie, date_follow_up[event == "sick"][1] ). 首先,我将您的日期(以字符形式)转换为Date对象。然后,我将数据按id分组,并减去healthy的第一天来计算时间差(即date_follow_up[event == "healthy"][1] )从sick的第一天开始(即date_follow_up[event == "sick"][1] )。 Finally, I replaced the time difference with NA for irrelevant rows. 最后,对于不相关的行,我用NA替换了时差。

   id   event date_follow_up
1   1 healthy         4/1/15
2   1                 4/2/15
3   1                 4/3/15
4   1    sick         4/4/15
5   1    sick         4/5/15
6   2                 4/1/15
7   2 healthy         4/2/15
8   2                 4/3/15
9   2                 4/4/15
10  2                 4/5/15
11  3                 4/1/15
12  3 healthy         4/2/15
13  3    sick         4/3/15
14  3                 4/4/15
15  3                 4/5/15

library(dplyr)
mutate(mydf, date_follow_up = as.Date(date_follow_up, format = "%m/%d/%y")) %>%
group_by(id) %>%
mutate(foo = date_follow_up[event == "sick"][1] - date_follow_up[event == "healthy"][1],        
       foo = replace(foo, which(event != "healthy"), NA))


Source: local data frame [15 x 4]
Groups: id [3]

      id   event date_follow_up            foo
   <int>   <chr>         <date> <S3: difftime>
1      1 healthy     2015-04-01         3 days
2      1             2015-04-02        NA days
3      1             2015-04-03        NA days
4      1    sick     2015-04-04        NA days
5      1    sick     2015-04-05        NA days
6      2             2015-04-01        NA days
7      2 healthy     2015-04-02        NA days
8      2             2015-04-03        NA days
9      2             2015-04-04        NA days
10     2             2015-04-05        NA days
11     3             2015-04-01        NA days
12     3 healthy     2015-04-02         1 days
13     3    sick     2015-04-03        NA days
14     3             2015-04-04        NA days
15     3             2015-04-05        NA days

DATA 数据

mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L), event = c("healthy", "", "", "sick", "sick", 
"", "healthy", "", "", "", "", "healthy", "sick", "", ""), date_follow_up = c("4/1/15", 
"4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/1/15", "4/2/15", "4/3/15", 
"4/4/15", "4/5/15", "4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15"
)), .Names = c("id", "event", "date_follow_up"), row.names = c(NA, 
-15L), class = "data.frame")

We can also use data.table . 我们还可以使用data.table Convert the 'data.frame' to 'data.table' ( setDT(mydf) ), change the class of 'date_follow_up to Date using as.Date , grouped by 'id' and a grouping variable created by getting the cumulative sum of logical vector ( event == "healthy" ), we get the difference of 'date_follow_up' for the first "sick" 'event' with the first 'date_follow_up' (which would be "healthy") if there are any "sick" 'event' in that particular group or else return "NA". 将'data.frame'转换为'data.table'( setDT(mydf) ),使用as.Date将'date_follow_up'的类更改为Date ,按'id'分组,并通过获取逻辑和的累加值创建分组变量向量( event == "healthy" ), if存在any “病态”事件,我们将获得第一个“病态”“事件”与第一个“ date_follow_up”(即“健康”)的“ date_follow_up”差异在该特定组中, else返回“ NA”。

library(data.table)
setDT(mydf)[, date_follow_up := as.Date(date_follow_up, "%m/%d/%y")
    ][, foo := if(any(event == "sick"))  
                  as.integer(date_follow_up[which(event=="sick")[1]] - 
                         date_follow_up[1] )
                else NA_integer_ , 
     by = .(grp= cumsum(event == "healthy"), id)]

Then, we can change the "foo" to "NA" for all "event" that are not "healthy". 然后,对于所有不“健康”的“事件”,我们可以将“ foo”更改为“ NA”。

mydf[event!= "healthy", foo := NA_integer_]
mydf
#    id   event date_follow_up foo
# 1:  1 healthy     2015-04-01   3
# 2:  1             2015-04-02  NA
# 3:  1             2015-04-03  NA
# 4:  1    sick     2015-04-04  NA
# 5:  1    sick     2015-04-05  NA
# 6:  2             2015-04-01  NA
# 7:  2 healthy     2015-04-02  NA
# 8:  2             2015-04-03  NA
# 9:  2             2015-04-04  NA
#10:  2             2015-04-05  NA
#11:  3             2015-04-01  NA
#12:  3 healthy     2015-04-02   1
#13:  3    sick     2015-04-03  NA
#14:  3             2015-04-04  NA
#15:  3             2015-04-05  NA
#16:  4             2015-04-01  NA
#17:  4 healthy     2015-04-02   3
#18:  4             2015-04-03  NA
#19:  4             2015-04-04  NA
#20:  4    sick     2015-04-05  NA
#21:  4    sick     2015-04-06  NA
#22:  4             2015-04-07  NA
#23:  4 healthy     2015-04-08   2
#24:  4             2015-04-09  NA
#25:  4    sick     2015-04-10  NA

NOTE: Here, I prepared data where there can be multiple "healthy/sick" 'event' possible for a particular "id". 注意:在这里,我准备的数据可能对于一个特定的“ id”可能有多个“健康/病假”“事件”。

data 数据

mydf <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), event = c("healthy", "", 
"", "sick", "sick", "", "healthy", "", "", "", "", "healthy", 
"sick", "", "", "", "healthy", "", "", "sick", "sick", "", "healthy", 
"", "sick"), date_follow_up = c("4/1/15", "4/2/15", "4/3/15", 
"4/4/15", "4/5/15", "4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", 
"4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/1/15", "4/2/15", 
"4/3/15", "4/4/15", "4/5/15", "4/6/15", "4/7/15", "4/8/15", "4/9/15", 
"4/10/15")), .Names = c("id", "event", "date_follow_up"), row.names = c(NA, 
25L), class = "data.frame")

Here's an approach, though you may need to adapt it to become more robust if you have multiple "healthy" events per ID: 这是一种方法,但是如果每个ID有多个“健康”事件,则可能需要对其进行调整以变得更加健壮:

        # turn dates into subtractable Date class
df1 %>% mutate(date_follow_up = as.Date(date_follow_up, '%m/%d/%y')) %>% 
    group_by(id) %>%
           # Add new column. If there is a "healthy" event,
    mutate(diff_time = ifelse(event == 'healthy', 
                              # subtract the date from the minimum "sick" date
                              min(date_follow_up[event == 'sick']) - date_follow_up, 
                              # else if it isn't a "healthy" event, return NA.
                              NA))

## Source: local data frame [6 x 4]
## 
##      id   event date_follow_up diff_time
##   <dbl>   <chr>         <date>     <dbl>
## 1     1 healthy     2015-04-01         3
## 2     1             2015-04-02        NA
## 3     1             2015-04-03        NA
## 4     1    sick     2015-04-04        NA
## 5     1    sick     2015-04-05        NA
## 6     1             2015-04-06        NA

Here's another approach using dplyr (although it's a bit longer compared to the earlier solution) 这是使用dplyr的另一种方法(尽管与以前的解决方案相比要更长一些)

library(dplyr)
df1$date_follow_up <- as.Date(df1$date_follow_up, "%m/%d/%y")

df1 %>% group_by(id, event) %>%
        filter(event %in% c("healthy", "sick")) %>%
        slice(which.min(date_follow_up)) %>% group_by(id) %>%
        mutate(diff_time = lead(date_follow_up) - date_follow_up) %>% 
        right_join(df1, by = c("id", "event" , "date_follow_up"))

# Output 

Source: local data frame [6 x 4]
Groups: id [?]

      id   event   date_follow_up       diff_time
     <dbl>   <chr>         <date>  <S3: difftime>
1     1   healthy     2015-04-01         3 days
2     1               2015-04-02        NA days
3     1               2015-04-03        NA days
4     1      sick     2015-04-04        NA days    
5     1      sick     2015-04-05        NA days
6     1               2015-04-06        NA days

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM