I want to consolidate the observations into continuous (no gaps in days covered) date ranges. each patid may have multiple ranges in the resulting data frame.I know it can be done with loop.But, is there an effective way to handle this task? Notice that the time interval is not overlapping here and the start_date is increasing.
data is here( i use R:dput, you can copy and assign to your object in R):
structure(list(patid = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), start_date = structure(c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 8L, 9L, 6L, 7L, 10L), .Label = c("1/1/2010", "2/1/2010",
"3/1/2010", "4/1/2010", "5/1/2010", "5/6/2011", "7/1/2012", "8/1/2010",
"9/1/2010", "9/1/2012"), class = "factor"), end_date = structure(c(1L,
3L, 4L, 5L, 6L, 1L, 3L, 4L, 8L, 10L, 7L, 9L, 2L), .Label = c("1/31/2010",
"12/1/2012", "2/28/2010", "3/31/2010", "4/30/2010", "5/31/2010",
"6/15/2011", "8/31/2010", "8/31/2012", "9/30/2010"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))
A data.table
approach (with magrittr
for more readability) (robust version):
library(data.table)
library(magrittr)
calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))
df_merged <- setDT(df) %>%
.[, `:=` (cont_start = as.Date(as.character(start_date), "%m/%d/%Y"),
cont_end = as.Date(as.character(end_date), "%m/%d/%Y"))] %>%
.[order(patid, start_date),] %>%
.[, max_until_now := shift(calc_cummax(cont_end)), by = patid] %>%
.[, lead_max := shift(max_until_now, type = "lead"), by = patid] %>%
.[is.na(max_until_now), max_until_now := lead_max, by = patid] %>%
.[(max_until_now + 1L) >= cont_start, gap_between_contracts := 0, by = patid] %>%
.[(max_until_now + 1L) < cont_start, gap_between_contracts := 1, by = patid] %>%
.[is.na(gap_between_contracts), gap_between_contracts := 0] %>%
.[, ("fakeidx") := cumsum(gap_between_contracts), by = patid] %>%
.[, .(cont_start = min(cont_start), cont_end = max(cont_end)), by = .(patid, fakeidx)] %>%
.[, ("fakeidx") := NULL]
Output in your case:
patid cont_start cont_end
1: 1 2010-01-01 2010-05-31
2: 2 2010-01-01 2010-03-31
3: 2 2010-08-01 2010-09-30
4: 3 2011-05-06 2011-06-15
5: 3 2012-07-01 2012-12-01
A tidyverse
approach (non-robust, simple version):
library(tidyverse)
df %>%
mutate(
cont_start = as.Date(as.character(start_date), "%m/%d/%Y"),
cont_end = as.Date(as.character(end_date), "%m/%d/%Y")
) %>%
arrange(patid, cont_start) %>%
group_by(patid) %>%
mutate(
idx = cumsum(coalesce(as.numeric(cont_start != (lag(cont_end) + 1)), 0))
) %>%
group_by(patid, idx) %>%
summarise(
cont_start = min(cont_start),
cont_end = max(cont_end)
) %>% select(-idx)
Output:
# A tibble: 5 x 3
# Groups: patid [3]
patid cont_start cont_end
<int> <date> <date>
1 1 2010-01-01 2010-05-31
2 2 2010-01-01 2010-03-31
3 2 2010-08-01 2010-09-30
4 3 2011-05-06 2011-06-15
5 3 2012-07-01 2012-12-01
The output in your case is the same, but if it happens at any time that you would have a start date in the sequence that would have a higher end date than a later start date, you'd need to go for the first (robust) approach (of course if you wouldn't consider this to be an error).
In this case, robustness does not have anything to do with either data.table
or tidyverse
(you can use the calc_cummax
function also by rewriting the tidyverse
version, but you'd need to load data.table
).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.