[英]dplyr - arrange, group, compute difference in dates
我有一個龐大的數據集,顯示了從“健康”事件到后續“病假”事件的孩子的跟進情況
我正在嘗試使用dplyr計算“健康”事件與第一個“病假”事件之間的時間
模擬數據集
id <- c(1,1,1,1,1,1)
event <- c("healthy","","","sick","sick","")
date_follow_up <- c("4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/6/15")
df1 <- data_frame(id, event, date_follow_up)
模擬輸出數據集
id <- c(1,1,1,1,1,1)
event <- c("healthy","","","sick","sick","")
date_follow_up <- c("4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/6/15")
diff_time <- c(3,"","","","","")
df1 <- data_frame(id, event, date_follow_up, diff_time)
我只能使用dplyr按“ id”和“ date_follow_up”對數據進行排序,然后按“ id”對數據進行分組:
df2 <- df1 %>% arrange(id, date_follow_up) %>% group_by(id)
請在計算日期差異並將其添加到每個人的“健康”事件的行旁邊時需要幫助:)
使用@ akrun的數據。例如,下面是一個使用滾動單程從data.table 加入 :
require(data.table)
dt = as.data.table(mydf)[, date_follow_up := as.Date(date_follow_up, format="%m/%d/%y")][]
dt1 = dt[event == "healthy"]
dt2 = dt[event == "sick"]
idx = dt2[dt1, roll = -Inf, which = TRUE, on = c("id", "date_follow_up")]
這個想法是:對於每個健康日期(以dt1
),獲取第一個患病日期的索引(以dt2
) >=
健康日期。
然后,直接減去兩個日期即可得出最終結果。
dt[event == "healthy",
diff := as.integer(dt2$date_follow_up[idx] - dt1$date_follow_up)]
我還對您的數據進行了一些修改,以徹底檢查這種情況。 我的建議類似於利斯特主義者的建議。 我的建議可以為mydf
id 2生成NA,而利斯特的建議可以創建Inf。 首先,我將您的日期(以字符形式)轉換為Date對象。然后,我將數據按id
分組,並減去healthy
的第一天來計算時間差(即date_follow_up[event == "healthy"][1]
)從sick
的第一天開始(即date_follow_up[event == "sick"][1]
)。 最后,對於不相關的行,我用NA替換了時差。
id event date_follow_up
1 1 healthy 4/1/15
2 1 4/2/15
3 1 4/3/15
4 1 sick 4/4/15
5 1 sick 4/5/15
6 2 4/1/15
7 2 healthy 4/2/15
8 2 4/3/15
9 2 4/4/15
10 2 4/5/15
11 3 4/1/15
12 3 healthy 4/2/15
13 3 sick 4/3/15
14 3 4/4/15
15 3 4/5/15
library(dplyr)
mutate(mydf, date_follow_up = as.Date(date_follow_up, format = "%m/%d/%y")) %>%
group_by(id) %>%
mutate(foo = date_follow_up[event == "sick"][1] - date_follow_up[event == "healthy"][1],
foo = replace(foo, which(event != "healthy"), NA))
Source: local data frame [15 x 4]
Groups: id [3]
id event date_follow_up foo
<int> <chr> <date> <S3: difftime>
1 1 healthy 2015-04-01 3 days
2 1 2015-04-02 NA days
3 1 2015-04-03 NA days
4 1 sick 2015-04-04 NA days
5 1 sick 2015-04-05 NA days
6 2 2015-04-01 NA days
7 2 healthy 2015-04-02 NA days
8 2 2015-04-03 NA days
9 2 2015-04-04 NA days
10 2 2015-04-05 NA days
11 3 2015-04-01 NA days
12 3 healthy 2015-04-02 1 days
13 3 sick 2015-04-03 NA days
14 3 2015-04-04 NA days
15 3 2015-04-05 NA days
數據
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L), event = c("healthy", "", "", "sick", "sick",
"", "healthy", "", "", "", "", "healthy", "sick", "", ""), date_follow_up = c("4/1/15",
"4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/1/15", "4/2/15", "4/3/15",
"4/4/15", "4/5/15", "4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15"
)), .Names = c("id", "event", "date_follow_up"), row.names = c(NA,
-15L), class = "data.frame")
我們還可以使用data.table
。 將'data.frame'轉換為'data.table'( setDT(mydf)
),使用as.Date
將'date_follow_up'的類更改為Date
,按'id'分組,並通過獲取邏輯和的累加值創建分組變量向量( event == "healthy"
), if
存在any
“病態”事件,我們將獲得第一個“病態”“事件”與第一個“ date_follow_up”(即“健康”)的“ date_follow_up”差異在該特定組中, else
返回“ NA”。
library(data.table)
setDT(mydf)[, date_follow_up := as.Date(date_follow_up, "%m/%d/%y")
][, foo := if(any(event == "sick"))
as.integer(date_follow_up[which(event=="sick")[1]] -
date_follow_up[1] )
else NA_integer_ ,
by = .(grp= cumsum(event == "healthy"), id)]
然后,對於所有不“健康”的“事件”,我們可以將“ foo”更改為“ NA”。
mydf[event!= "healthy", foo := NA_integer_]
mydf
# id event date_follow_up foo
# 1: 1 healthy 2015-04-01 3
# 2: 1 2015-04-02 NA
# 3: 1 2015-04-03 NA
# 4: 1 sick 2015-04-04 NA
# 5: 1 sick 2015-04-05 NA
# 6: 2 2015-04-01 NA
# 7: 2 healthy 2015-04-02 NA
# 8: 2 2015-04-03 NA
# 9: 2 2015-04-04 NA
#10: 2 2015-04-05 NA
#11: 3 2015-04-01 NA
#12: 3 healthy 2015-04-02 1
#13: 3 sick 2015-04-03 NA
#14: 3 2015-04-04 NA
#15: 3 2015-04-05 NA
#16: 4 2015-04-01 NA
#17: 4 healthy 2015-04-02 3
#18: 4 2015-04-03 NA
#19: 4 2015-04-04 NA
#20: 4 sick 2015-04-05 NA
#21: 4 sick 2015-04-06 NA
#22: 4 2015-04-07 NA
#23: 4 healthy 2015-04-08 2
#24: 4 2015-04-09 NA
#25: 4 sick 2015-04-10 NA
注意:在這里,我准備的數據可能對於一個特定的“ id”可能有多個“健康/病假”“事件”。
mydf <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), event = c("healthy", "",
"", "sick", "sick", "", "healthy", "", "", "", "", "healthy",
"sick", "", "", "", "healthy", "", "", "sick", "sick", "", "healthy",
"", "sick"), date_follow_up = c("4/1/15", "4/2/15", "4/3/15",
"4/4/15", "4/5/15", "4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15",
"4/1/15", "4/2/15", "4/3/15", "4/4/15", "4/5/15", "4/1/15", "4/2/15",
"4/3/15", "4/4/15", "4/5/15", "4/6/15", "4/7/15", "4/8/15", "4/9/15",
"4/10/15")), .Names = c("id", "event", "date_follow_up"), row.names = c(NA,
25L), class = "data.frame")
這是一種方法,但是如果每個ID有多個“健康”事件,則可能需要對其進行調整以變得更加健壯:
# turn dates into subtractable Date class
df1 %>% mutate(date_follow_up = as.Date(date_follow_up, '%m/%d/%y')) %>%
group_by(id) %>%
# Add new column. If there is a "healthy" event,
mutate(diff_time = ifelse(event == 'healthy',
# subtract the date from the minimum "sick" date
min(date_follow_up[event == 'sick']) - date_follow_up,
# else if it isn't a "healthy" event, return NA.
NA))
## Source: local data frame [6 x 4]
##
## id event date_follow_up diff_time
## <dbl> <chr> <date> <dbl>
## 1 1 healthy 2015-04-01 3
## 2 1 2015-04-02 NA
## 3 1 2015-04-03 NA
## 4 1 sick 2015-04-04 NA
## 5 1 sick 2015-04-05 NA
## 6 1 2015-04-06 NA
這是使用dplyr
的另一種方法(盡管與以前的解決方案相比要更長一些)
library(dplyr)
df1$date_follow_up <- as.Date(df1$date_follow_up, "%m/%d/%y")
df1 %>% group_by(id, event) %>%
filter(event %in% c("healthy", "sick")) %>%
slice(which.min(date_follow_up)) %>% group_by(id) %>%
mutate(diff_time = lead(date_follow_up) - date_follow_up) %>%
right_join(df1, by = c("id", "event" , "date_follow_up"))
# Output
Source: local data frame [6 x 4]
Groups: id [?]
id event date_follow_up diff_time
<dbl> <chr> <date> <S3: difftime>
1 1 healthy 2015-04-01 3 days
2 1 2015-04-02 NA days
3 1 2015-04-03 NA days
4 1 sick 2015-04-04 NA days
5 1 sick 2015-04-05 NA days
6 1 2015-04-06 NA days
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.