匹配 data.table r 中开始和结束日期相同或接近的行

Question

Following data.table以下data.table

df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
                 start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
                 end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
                 variable1=c("a","c","c","d","a",NA,"a","a","b"))
df                 
id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-08-01 2019-12-01         c
3:  2 2019-07-12 2019-07-30         c
4:  2 2017-05-24 2017-11-24         d
5:  3 2016-05-08 2017-07-25         a
6:  3 2017-08-01 2018-08-01      <NA>
7:  4 2019-06-12 2019-12-12         a
8:  4 2017-02-24 2017-08-24         a
9:  4 2017-08-24 2018-08-24         b

Within the same ID, I want to compare the start_date and end_date .在同一个 ID 中，我想比较start_date和end_date 。 If the end_date of one row is within 30 days of the start_date of another row, I want to combine the rows.如果一行的end_date在另一行的start_date的 30 天内，我想合并这些行。 So that it looks like this:所以它看起来像这样：

id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-07-12 2019-12-01         c
3:  2 2017-05-24 2017-11-24         d
4:  3 2016-05-08 2018-08-01         a
5:  4 2019-06-12 2019-12-12         a
6:  4 2017-02-24 2017-08-24         a
7:  4 2017-08-24 2018-08-24         b

If the other variables of the rows are the same, rows should be combined with the earliest start_date and latest end_date as id number 2. If the variable1 is NA it should be replaced with values from the matching row as id number 3. If the variable1 has different values, rows should remain separate as id number 4. The data.table contains more variables and objects than displayed here.如果行的其他变量相同，则行应与最早的start_date和最新的end_date组合为id号 2。如果variable1是NA ，则应将其替换为匹配行中的值作为id号 3。如果variable1具有不同的值，行应保持独立为id号data.table包含的变量和对象比此处显示的要多。 Preferable a function in data.table .最好在 data.table 中使用data.table 。

Answer 1

Not clear what happens if an id has 3 overlapping rows with variable1 = c('a', NA, 'b') , what should the variable1 be for the NA for this case?不清楚如果 id 有 3 个重叠行且variable1 = c('a', NA, 'b')会发生什么，对于这种情况，对于 NA， variable1应该是什么？ a or b ? a还是b ？

If we just choose the first variable1 when there are multiple matches, here is an option to first fill the NA and then borrow the idea from David Aurenburg's solution here如果我们在有多个匹配项时只选择第一个variable1 1，这里有一个选项，先填充 NA，然后在此处借用 David Aurenburg 的解决方案的想法

setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]

df[is.na(variable1), variable1 :=
    df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]

df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
    .(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]

output: output：

   id variable1 g start_date   end_date
1:  1         a 0 2019-05-08 2019-09-08
2:  2         d 0 2017-05-24 2017-11-24
3:  2         c 1 2019-07-12 2019-12-01
4:  3         a 0 2016-05-08 2018-08-01
5:  4         a 0 2017-02-24 2017-08-24
6:  4         b 0 2017-08-24 2018-08-24
7:  4         a 1 2019-06-12 2019-12-12

data:数据：

library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
    start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
    end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
    variable1=c("a","c","c","d","a",NA,"a","a","b"))

匹配 data.table r 中开始和结束日期相同或接近的行

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-24 01:58:55

匹配 data.table r 中开始和结束日期相同或接近的行

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-24 01:58:55

解决方案1
1 已采纳 2020-08-24 01:58:55