繁体   English   中英

匹配 data.table r 中开始和结束日期相同或接近的行

[英]Match rows with the same or close start and end date in data.table r

以下data.table

df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
                 start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
                 end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
                 variable1=c("a","c","c","d","a",NA,"a","a","b"))
df                 
id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-08-01 2019-12-01         c
3:  2 2019-07-12 2019-07-30         c
4:  2 2017-05-24 2017-11-24         d
5:  3 2016-05-08 2017-07-25         a
6:  3 2017-08-01 2018-08-01      <NA>
7:  4 2019-06-12 2019-12-12         a
8:  4 2017-02-24 2017-08-24         a
9:  4 2017-08-24 2018-08-24         b

在同一个 ID 中,我想比较start_dateend_date 如果一行的end_date在另一行的start_date的 30 天内,我想合并这些行。 所以它看起来像这样:

id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-07-12 2019-12-01         c
3:  2 2017-05-24 2017-11-24         d
4:  3 2016-05-08 2018-08-01         a
5:  4 2019-06-12 2019-12-12         a
6:  4 2017-02-24 2017-08-24         a
7:  4 2017-08-24 2018-08-24         b

如果行的其他变量相同,则行应与最早的start_date和最新的end_date组合为id号 2。如果variable1NA ,则应将其替换为匹配行中的值作为id号 3。如果variable1具有不同的值,行应保持独立为iddata.table包含的变量和对象比此处显示的要多。 最好在 data.table 中使用data.table

不清楚如果 id 有 3 个重叠行且variable1 = c('a', NA, 'b')会发生什么,对于这种情况,对于 NA, variable1应该是什么? a还是b

如果我们在有多个匹配项时只选择第一个variable1 1,这里有一个选项,先填充 NA,然后在此处借用 David Aurenburg 的解决方案的想法

setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]

df[is.na(variable1), variable1 :=
    df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]

df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
    .(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]

output:

   id variable1 g start_date   end_date
1:  1         a 0 2019-05-08 2019-09-08
2:  2         d 0 2017-05-24 2017-11-24
3:  2         c 1 2019-07-12 2019-12-01
4:  3         a 0 2016-05-08 2018-08-01
5:  4         a 0 2017-02-24 2017-08-24
6:  4         b 0 2017-08-24 2018-08-24
7:  4         a 1 2019-06-12 2019-12-12

数据:

library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
    start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
    end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
    variable1=c("a","c","c","d","a",NA,"a","a","b"))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM