[英]Match rows with the same or close start and end date in data.table r
Following data.table
以下data.table
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
variable1=c("a","c","c","d","a",NA,"a","a","b"))
df
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-08-01 2019-12-01 c
3: 2 2019-07-12 2019-07-30 c
4: 2 2017-05-24 2017-11-24 d
5: 3 2016-05-08 2017-07-25 a
6: 3 2017-08-01 2018-08-01 <NA>
7: 4 2019-06-12 2019-12-12 a
8: 4 2017-02-24 2017-08-24 a
9: 4 2017-08-24 2018-08-24 b
Within the same ID, I want to compare the start_date
and end_date
.在同一个 ID 中,我想比较start_date
和end_date
。 If the end_date
of one row is within 30 days of the start_date
of another row, I want to combine the rows.如果一行的end_date
在另一行的start_date
的 30 天内,我想合并这些行。 So that it looks like this:所以它看起来像这样:
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-07-12 2019-12-01 c
3: 2 2017-05-24 2017-11-24 d
4: 3 2016-05-08 2018-08-01 a
5: 4 2019-06-12 2019-12-12 a
6: 4 2017-02-24 2017-08-24 a
7: 4 2017-08-24 2018-08-24 b
If the other variables of the rows are the same, rows should be combined with the earliest start_date
and latest end_date
as id
number 2. If the variable1
is NA
it should be replaced with values from the matching row as id
number 3. If the variable1
has different values, rows should remain separate as id
number 4. The data.table
contains more variables and objects than displayed here.如果行的其他变量相同,则行应与最早的start_date
和最新的end_date
组合为id
号 2。如果variable1
是NA
,则应将其替换为匹配行中的值作为id
号 3。如果variable1
具有不同的值,行应保持独立为id
号data.table
包含的变量和对象比此处显示的要多。 Preferable a function in data.table
.最好在 data.table 中使用data.table
。
Not clear what happens if an id has 3 overlapping rows with variable1 = c('a', NA, 'b')
, what should the variable1
be for the NA for this case?不清楚如果 id 有 3 个重叠行且variable1 = c('a', NA, 'b')
会发生什么,对于这种情况,对于 NA, variable1
应该是什么? a
or b
? a
还是b
?
If we just choose the first variable1
when there are multiple matches, here is an option to first fill the NA and then borrow the idea from David Aurenburg's solution here如果我们在有多个匹配项时只选择第一个variable1
1,这里有一个选项,先填充 NA,然后在此处借用 David Aurenburg 的解决方案的想法
setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]
df[is.na(variable1), variable1 :=
df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]
df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
.(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]
output: output:
id variable1 g start_date end_date
1: 1 a 0 2019-05-08 2019-09-08
2: 2 d 0 2017-05-24 2017-11-24
3: 2 c 1 2019-07-12 2019-12-01
4: 3 a 0 2016-05-08 2018-08-01
5: 4 a 0 2017-02-24 2017-08-24
6: 4 b 0 2017-08-24 2018-08-24
7: 4 a 1 2019-06-12 2019-12-12
data:数据:
library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
variable1=c("a","c","c","d","a",NA,"a","a","b"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.