I have the following dataset:
id | row_name | start_date | end_date | rows_overlap_period |
---|---|---|---|---|
person_1 | 1 | 2010-04-23 | 2010-06-22 | 2,3,4,5,6 |
person_1 | 2 | 2010-04-25 | 2010-06-24 | 3,4,5,6 |
person_1 | 3 | 2010-04-27 | 2010-06-26 | 4,5,6,7 |
person_1 | 4 | 2010-04-29 | 2010-06-28 | 5,6,7,8 |
person_1 | 5 | 2010-04-30 | 2010-06-29 | 6,7,8 |
person_1 | 6 | 2010-05-08 | 2010-07-07 | 7,8 |
person_1 | 7 | 2010-06-26 | 2010-08-25 | 8 |
person_1 | 8 | 2010-06-28 | 2010-08-27 | |
person_2 | 9 | 2010-07-30 | 2010-09-28 | 10 |
person_2 | 10 | 2010-08-02 | 2010-10-01 |
The "rows_overlap_period" column indicates which other records started between the 'start_date' and 'end_date' period.
However, I would iterate within each group to arrive at the following result:
id | row_name | start_date | end_date | rows_overlap_period |
---|---|---|---|---|
person_1 | 1 | 2010-04-23 | 2010-06-22 | 2,3,4,5,6 |
person_1 | 2 | 2010-04-25 | 2010-06-24 | |
person_1 | 3 | 2010-04-27 | 2010-06-26 | |
person_1 | 4 | 2010-04-29 | 2010-06-28 | |
person_1 | 5 | 2010-04-30 | 2010-06-29 | |
person_1 | 6 | 2010-05-08 | 2010-07-07 | |
person_1 | 7 | 2010-06-26 | 2010-08-25 | 8 |
person_1 | 8 | 2010-06-28 | 2010-08-27 | |
person_2 | 9 | 2010-07-30 | 2010-09-28 | 10 |
person_2 | 10 | 2010-08-02 | 2010-10-01 |
This "output" would be the result of the 'following algorithm':
For each group:
Reproducible example (what I got so far):
# Input data
data.frame(id = c("person_1", "person_1", "person_1", "person_1", "person_1",
"person_1", "person_1", "person_1", "person_2",
"person_2"),
row_name = rep(1:10),
start_date = as.Date(c("2010-04-23", "2010-04-25", "2010-04-27",
"2010-04-29", "2010-04-30", "2010-05-08",
"2010-06-26", "2010-06-28", "2010-07-30",
"2010-08-02")),
end_date = as.Date(c("2010-06-22", "2010-06-24", "2010-06-26",
"2010-06-28", "2010-06-29", "2010-07-07",
"2010-08-25", "2010-08-27", "2010-09-28",
"2010-10-01"))) -> data
# Find overlaps (column rows_overlap_period)
sqldf::sqldf("select a.*,
coalesce(group_concat(b.row_name), ' ') as rows_overlap_period
from data a
left join data b on
a.id = b.id and
not a.row_name = b.row_name and
(b.start_date between
a.start_date and a.end_date)
group by a.rowid
order by a.rowid") -> data
I was really trying to find some solution using dplyr, data.table or sqldf directly, but I can't find ways not to implement 'loops within loops' - which would degrade performance a lot.
Does anyone have any suggestions on how I can reach this?
We could create a grouping column to do this in addition to the 'id' column
library(dplyr)
data %>%
group_by(id) %>%
mutate(grp = cumsum(lead(!nzchar(trimws(rows_overlap_period)),
default = FALSE))) %>%
group_by(grp, .add = TRUE) %>%
mutate(rows_overlap_period = case_when(row_number() ==1 ~
rows_overlap_period, TRUE ~ "")) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 10 × 5
id row_name start_date end_date rows_overlap_period
<chr> <int> <date> <date> <chr>
1 person_1 1 2010-04-23 2010-06-22 "2,3,4,5,6"
2 person_1 2 2010-04-25 2010-06-24 ""
3 person_1 3 2010-04-27 2010-06-26 ""
4 person_1 4 2010-04-29 2010-06-28 ""
5 person_1 5 2010-04-30 2010-06-29 ""
6 person_1 6 2010-05-08 2010-07-07 ""
7 person_1 7 2010-06-26 2010-08-25 "8"
8 person_1 8 2010-06-28 2010-08-27 ""
9 person_2 9 2010-07-30 2010-09-28 "10"
10 person_2 10 2010-08-02 2010-10-01 ""
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.