[英]R lagging through dates with conditions
我有一個數據集,該數據集包含與個人可能重疊的時間片段,也就是說,片段可以比之前的片段開始得晚,但比之前的片段要早。 由於存在重疊問題,一旦按start_date排序,我就很難獲得序列中的最新end_date。
我一直在使用的代碼可以工作到一定程度,但是我必須重復下面的代碼所示。 出於這個原因,我想我需要一些循環函數來處理一個流程,直到滿足條件為止(即end_date晚於上一行的end_date,或者id表示一個新的個體)。
library(dplyr)
## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
"A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
"2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
"2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
"2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
"2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
"2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
"2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
"2012-09-28","2013-10-11"))
df <- data.frame(id, start_date, end_date, target_date)
使用該方法展平重疊的時間段會使我接近,但我認為添加某個地方來復制target_date需要一定的滯后...
df <- df %>%
arrange(id, start_date) %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
cummax(as.numeric(end_date)))[-n()])) %>%
group_by(id, indx) %>%
mutate(latest_date = max(end_date)) %>%
ungroup()
我會給這個問題一種不同於使用lag
。 問題是您的數據中存在一個可以具有多個級別的層次結構。
在下面的代碼中,我嘗試查找當前行所屬的其他情節(即完全位於另一情節內)。 然后,我以min(start_date)
和max(end_date)
定義最外層情節。
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(id, start_date, end_date, target_date) %>%
mutate(episode = row_number())
df %>%
select(id, episode,start_date, end_date) %>%
inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
group_by(id,episode,start_date, end_date) %>%
nest() %>%
mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
end_date_outer >= ..3))) %>%
mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))
這導致
# A tibble: 19 x 8
id episode start_date end_date data match start_date_parent end_date_parent
<fct> <int> <date> <date> <list> <list> <date> <date>
1 A 1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23 2009-06-30
2 A 2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
3 A 3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
4 A 4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
5 A 5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
6 A 6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
7 A 7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
8 A 8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12 2010-11-05
9 A 9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08 2011-11-18
10 A 10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
11 A 11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
12 A 12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11 2012-09-11
13 A 13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03 2014-04-01
14 B 14 2003-12-01 2012-08-29 <tibble [6 x 3]> <tibble [1 x 3]> 2003-12-01 2012-08-29
15 B 15 2006-02-28 2006-02-28 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
16 B 16 2012-04-16 2012-04-16 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
17 B 17 2012-08-30 2012-09-28 <tibble [6 x 3]> <tibble [1 x 3]> 2012-08-30 2012-09-28
18 B 18 2012-09-19 2013-10-11 <tibble [6 x 3]> <tibble [1 x 3]> 2012-09-19 2013-10-11
19 B 19 2012-09-28 2013-07-19 <tibble [6 x 3]> <tibble [2 x 3]> 2012-09-19 2013-10-11
我們可以在這里看到id A的前7集是第1集的一部分,其余的則獨立存在。
另一個選擇是例如在數據集變大時使用sqldf
。
require(sqldf)
result <- sqldf("select
df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
from df AS df1
inner join df AS df2
on df1.id = df2.id
and df1.start_date >= df2.start_date
and df1.end_date <= df2.end_date
group by df1.id, df1.episode
")
result %>%
select(id, start_date, end_date) %>%
distinct()
結果是:
id start_date end_date
1 A 2004-01-23 2009-06-30
2 A 2010-10-12 2010-11-05
3 A 2011-08-08 2011-11-18
4 A 2012-06-26 2012-06-26
5 A 2012-09-11 2012-09-11
6 A 2012-10-03 2014-04-01
7 B 2003-12-01 2012-08-29
8 B 2012-08-30 2012-09-28
9 B 2012-09-19 2013-10-11
如果我理解正確,則OP希望確定較長情節完全包含的重疊情節。 此外,擁抱期的結束日期應顯示在下一行( id
)
這可以通過改變大衛·阿倫伯格的方法來實現:
df %>%
arrange(id, start_date) %>% # df must be ordered appropriately
group_by(id) %>% # create new grouping variable
mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>%
group_by(id, grp) %>%
mutate(target_date_new = max(end_date)) %>%
group_by(id) %>% # re-group ...
mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6 # Groups: id [2] id start_date end_date target_date grp target_date_new <fct> <date> <date> <date> <int> <date> 1 A 2004-01-23 2009-06-30 NA 1 NA 2 A 2005-03-31 2005-09-17 2009-06-30 1 2009-06-30 3 A 2005-03-31 2005-09-19 2009-06-30 1 2009-06-30 4 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 5 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 6 A 2006-04-03 2006-06-19 2009-06-30 1 2009-06-30 7 A 2007-11-26 2009-06-30 2009-06-30 1 2009-06-30 8 A 2010-10-12 2010-11-05 2009-06-30 2 2009-06-30 9 A 2011-08-08 2011-11-18 2010-11-05 3 2010-11-05 10 A 2012-06-26 2012-06-26 2011-11-18 4 2011-11-18 11 A 2012-06-26 2012-06-26 2012-06-26 4 2012-06-26 12 A 2012-09-11 2012-09-11 2012-06-26 5 2012-06-26 13 A 2012-10-03 2014-04-01 2012-09-11 6 2012-09-11 14 B 2003-12-01 2012-08-29 NA 1 NA 15 B 2006-02-28 2006-02-28 2012-08-29 1 2012-08-29 16 B 2012-04-16 2012-04-16 2012-08-29 1 2012-08-29 17 B 2012-08-30 2012-09-28 2012-08-29 2 2012-08-29 18 B 2012-09-19 2013-10-11 2012-09-28 3 2012-09-28 19 B 2012-09-28 2013-07-19 2013-10-11 3 2013-10-11
在這里,比較end_date
s是因為OP想要檢測完全包含的時間段。 因此,每當出現一個end_date
大於先前end_date
的任何一個時,情節計數器grp
都會前進,因為當前情節未完全包含在先前期間中。
由於cummax()
對於Date
類型的對象沒有方法,因此日期被強制為整數值。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.