簡體   English   中英

匹配 data.table r 中開始和結束日期相同或接近的行

[英]Match rows with the same or close start and end date in data.table r

以下data.table

df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
                 start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
                 end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
                 variable1=c("a","c","c","d","a",NA,"a","a","b"))
df                 
id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-08-01 2019-12-01         c
3:  2 2019-07-12 2019-07-30         c
4:  2 2017-05-24 2017-11-24         d
5:  3 2016-05-08 2017-07-25         a
6:  3 2017-08-01 2018-08-01      <NA>
7:  4 2019-06-12 2019-12-12         a
8:  4 2017-02-24 2017-08-24         a
9:  4 2017-08-24 2018-08-24         b

在同一個 ID 中,我想比較start_dateend_date 如果一行的end_date在另一行的start_date的 30 天內,我想合並這些行。 所以它看起來像這樣:

id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-07-12 2019-12-01         c
3:  2 2017-05-24 2017-11-24         d
4:  3 2016-05-08 2018-08-01         a
5:  4 2019-06-12 2019-12-12         a
6:  4 2017-02-24 2017-08-24         a
7:  4 2017-08-24 2018-08-24         b

如果行的其他變量相同,則行應與最早的start_date和最新的end_date組合為id號 2。如果variable1NA ,則應將其替換為匹配行中的值作為id號 3。如果variable1具有不同的值,行應保持獨立為iddata.table包含的變量和對象比此處顯示的要多。 最好在 data.table 中使用data.table

不清楚如果 id 有 3 個重疊行且variable1 = c('a', NA, 'b')會發生什么,對於這種情況,對於 NA, variable1應該是什么? a還是b

如果我們在有多個匹配項時只選擇第一個variable1 1,這里有一個選項,先填充 NA,然后在此處借用 David Aurenburg 的解決方案的想法

setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]

df[is.na(variable1), variable1 :=
    df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]

df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
    .(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]

output:

   id variable1 g start_date   end_date
1:  1         a 0 2019-05-08 2019-09-08
2:  2         d 0 2017-05-24 2017-11-24
3:  2         c 1 2019-07-12 2019-12-01
4:  3         a 0 2016-05-08 2018-08-01
5:  4         a 0 2017-02-24 2017-08-24
6:  4         b 0 2017-08-24 2018-08-24
7:  4         a 1 2019-06-12 2019-12-12

數據:

library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
    start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
    end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
    variable1=c("a","c","c","d","a",NA,"a","a","b"))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM