簡體   English   中英

帶條件的日期

[英]R lagging through dates with conditions

我有一個數據集,該數據集包含與個人可能重疊的時間片段,也就是說,片段可以比之前的片段開始得晚,但比之前的片段要早。 由於存在重疊問題,一旦按start_date排序,我就很難獲得序列中的最新end_date。

我一直在使用的代碼可以工作到一定程度,但是我必須重復下面的代碼所示。 出於這個原因,我想我需要一些循環函數來處理一個流程,直到滿足條件為止(即end_date晚於上一行的end_date,或者id表示一個新的個體)。

library(dplyr)

## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
        "A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
                        "2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
                        "2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
                        "2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
                      "2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                      "2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
                      "2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
                         "2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                         "2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
                         "2012-09-28","2013-10-11"))

df <- data.frame(id, start_date, end_date, target_date)

使用該方法展平重疊的時間段會使我接近,但我認為添加某個地方來復制target_date需要一定的滯后...

df <- df %>%
    arrange(id, start_date) %>%
    group_by(id) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
                                    cummax(as.numeric(end_date)))[-n()])) %>%
    group_by(id, indx) %>%       
    mutate(latest_date = max(end_date)) %>%
    ungroup()

我會給這個問題一種不同於使用lag 問題是您的數據中存在一個可以具有多個級別的層次結構。

在下面的代碼中,我嘗試查找當前行所屬的其他情節(即完全位於另一情節內)。 然后,我以min(start_date)max(end_date)定義最外層情節。


library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(id, start_date, end_date, target_date) %>%
  mutate(episode = row_number())

df %>%
  select(id, episode,start_date, end_date) %>%
  inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
  group_by(id,episode,start_date, end_date) %>%
  nest() %>%
  mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
                                                                end_date_outer >= ..3))) %>%
  mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
         end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))


這導致


# A tibble: 19 x 8
   id    episode start_date end_date   data              match            start_date_parent end_date_parent
   <fct>   <int> <date>     <date>     <list>            <list>           <date>            <date>         
 1 A           1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23        2009-06-30     
 2 A           2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 3 A           3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 4 A           4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 5 A           5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 6 A           6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 7 A           7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 8 A           8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12        2010-11-05     
 9 A           9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08        2011-11-18     
10 A          10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
11 A          11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
12 A          12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11        2012-09-11     
13 A          13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03        2014-04-01     
14 B          14 2003-12-01 2012-08-29 <tibble [6 x 3]>  <tibble [1 x 3]> 2003-12-01        2012-08-29     
15 B          15 2006-02-28 2006-02-28 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
16 B          16 2012-04-16 2012-04-16 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
17 B          17 2012-08-30 2012-09-28 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-08-30        2012-09-28     
18 B          18 2012-09-19 2013-10-11 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-09-19        2013-10-11     
19 B          19 2012-09-28 2013-07-19 <tibble [6 x 3]>  <tibble [2 x 3]> 2012-09-19        2013-10-11  

我們可以在這里看到id A的前7集是第1集的一部分,其余的則獨立存在。


另一個選擇是例如在數據集變大時使用sqldf


require(sqldf)

result <- sqldf("select
      df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
      from df AS df1

      inner join df AS df2 
      on df1.id = df2.id
      and df1.start_date >= df2.start_date
      and df1.end_date <= df2.end_date

      group by df1.id, df1.episode
      ")

result %>%
  select(id, start_date, end_date) %>%
  distinct()

結果是:


  id start_date   end_date
1  A 2004-01-23 2009-06-30
2  A 2010-10-12 2010-11-05
3  A 2011-08-08 2011-11-18
4  A 2012-06-26 2012-06-26
5  A 2012-09-11 2012-09-11
6  A 2012-10-03 2014-04-01
7  B 2003-12-01 2012-08-29
8  B 2012-08-30 2012-09-28
9  B 2012-09-19 2013-10-11

如果我理解正確,則OP希望確定較長情節完全包含的重疊情節。 此外,擁抱期的結束日期應顯示在下一行( id

這可以通過改變大衛·阿倫伯格的方法來實現:

df %>% 
  arrange(id, start_date) %>% # df must be ordered appropriately
  group_by(id) %>% # create new grouping variable
  mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>% 
  group_by(id, grp) %>% 
  mutate(target_date_new = max(end_date)) %>% 
  group_by(id) %>% # re-group ...
  mutate(target_date_new = lag(target_date_new)) # ... for lagging
 # A tibble: 19 x 6 # Groups: id [2] id start_date end_date target_date grp target_date_new <fct> <date> <date> <date> <int> <date> 1 A 2004-01-23 2009-06-30 NA 1 NA 2 A 2005-03-31 2005-09-17 2009-06-30 1 2009-06-30 3 A 2005-03-31 2005-09-19 2009-06-30 1 2009-06-30 4 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 5 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 6 A 2006-04-03 2006-06-19 2009-06-30 1 2009-06-30 7 A 2007-11-26 2009-06-30 2009-06-30 1 2009-06-30 8 A 2010-10-12 2010-11-05 2009-06-30 2 2009-06-30 9 A 2011-08-08 2011-11-18 2010-11-05 3 2010-11-05 10 A 2012-06-26 2012-06-26 2011-11-18 4 2011-11-18 11 A 2012-06-26 2012-06-26 2012-06-26 4 2012-06-26 12 A 2012-09-11 2012-09-11 2012-06-26 5 2012-06-26 13 A 2012-10-03 2014-04-01 2012-09-11 6 2012-09-11 14 B 2003-12-01 2012-08-29 NA 1 NA 15 B 2006-02-28 2006-02-28 2012-08-29 1 2012-08-29 16 B 2012-04-16 2012-04-16 2012-08-29 1 2012-08-29 17 B 2012-08-30 2012-09-28 2012-08-29 2 2012-08-29 18 B 2012-09-19 2013-10-11 2012-09-28 3 2012-09-28 19 B 2012-09-28 2013-07-19 2013-10-11 3 2013-10-11 

在這里,比較end_date s是因為OP想要檢測完全包含的時間段。 因此,每當出現一個end_date大於先前end_date的任何一個時,情節計數器grp都會前進,因為當前情節未完全包含在先前期間中。

由於cummax()對於Date類型的對象沒有方法,因此日期被強制為整數值。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM