简体   繁体   English

合并 R 中的连续日期范围

[英]Merging contiguous date ranges in R

I want to consolidate the observations into continuous (no gaps in days covered) date ranges.我想将观察结果合并为连续的(涵盖的天数没有间隔)日期范围。 each patid may have multiple ranges in the resulting data frame.I know it can be done with loop.But, is there an effective way to handle this task?每个patid在结果数据框中可能有多个范围。我知道它可以用循环来完成。但是,有没有一种有效的方法来处理这个任务? Notice that the time interval is not overlapping here and the start_date is increasing.请注意,这里的时间间隔没有重叠,并且 start_date 正在增加。

在此处输入图片说明

在此处输入图片说明

data is here( i use R:dput, you can copy and assign to your object in R):数据在这里(我使用 R:dput,您可以在 R 中复制并分配给您的对象):

structure(list(patid = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L), start_date = structure(c(1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 8L, 9L, 6L, 7L, 10L), .Label = c("1/1/2010", "2/1/2010", 
"3/1/2010", "4/1/2010", "5/1/2010", "5/6/2011", "7/1/2012", "8/1/2010", 
"9/1/2010", "9/1/2012"), class = "factor"), end_date = structure(c(1L, 
3L, 4L, 5L, 6L, 1L, 3L, 4L, 8L, 10L, 7L, 9L, 2L), .Label = c("1/31/2010", 
"12/1/2012", "2/28/2010", "3/31/2010", "4/30/2010", "5/31/2010", 
"6/15/2011", "8/31/2010", "8/31/2012", "9/30/2010"), class = "factor")), class = "data.frame", row.names = c(NA, 
-13L))

A data.table approach (with magrittr for more readability) (robust version): data.table方法(使用magrittr以提高可读性)(健壮版):

library(data.table)
library(magrittr)

calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

df_merged <- setDT(df) %>%
  .[, `:=` (cont_start = as.Date(as.character(start_date), "%m/%d/%Y"),
            cont_end = as.Date(as.character(end_date), "%m/%d/%Y"))] %>%
  .[order(patid, start_date),] %>%
  .[, max_until_now := shift(calc_cummax(cont_end)), by = patid] %>%
  .[, lead_max := shift(max_until_now, type = "lead"), by = patid] %>%
  .[is.na(max_until_now), max_until_now := lead_max, by = patid] %>%
  .[(max_until_now + 1L) >= cont_start, gap_between_contracts := 0, by = patid] %>% 
  .[(max_until_now + 1L) < cont_start, gap_between_contracts := 1, by = patid] %>%
  .[is.na(gap_between_contracts), gap_between_contracts := 0] %>% 
  .[, ("fakeidx") := cumsum(gap_between_contracts), by = patid] %>%
  .[, .(cont_start = min(cont_start), cont_end = max(cont_end)), by = .(patid, fakeidx)] %>% 
  .[, ("fakeidx") := NULL]

Output in your case:您的情况下的输出:

   patid cont_start   cont_end
1:     1 2010-01-01 2010-05-31
2:     2 2010-01-01 2010-03-31
3:     2 2010-08-01 2010-09-30
4:     3 2011-05-06 2011-06-15
5:     3 2012-07-01 2012-12-01

A tidyverse approach (non-robust, simple version):一种tidyverse方法(非健壮的简单版本):

library(tidyverse)

df %>%
  mutate(
    cont_start = as.Date(as.character(start_date), "%m/%d/%Y"),
    cont_end = as.Date(as.character(end_date), "%m/%d/%Y")
  ) %>%
  arrange(patid, cont_start) %>%
  group_by(patid) %>%
  mutate(
    idx = cumsum(coalesce(as.numeric(cont_start != (lag(cont_end) + 1)), 0))
  ) %>%
  group_by(patid, idx) %>%
  summarise(
    cont_start = min(cont_start),
    cont_end = max(cont_end)
  ) %>% select(-idx)

Output:输出:

# A tibble: 5 x 3
# Groups:   patid [3]
  patid cont_start cont_end  
  <int> <date>     <date>    
1     1 2010-01-01 2010-05-31
2     2 2010-01-01 2010-03-31
3     2 2010-08-01 2010-09-30
4     3 2011-05-06 2011-06-15
5     3 2012-07-01 2012-12-01

The output in your case is the same, but if it happens at any time that you would have a start date in the sequence that would have a higher end date than a later start date, you'd need to go for the first (robust) approach (of course if you wouldn't consider this to be an error).您的情况下的输出是相同的,但是如果在任何时候发生,您在序列中的开始日期将比较晚的开始日期更高,那么您需要选择第一个(健壮) 方法(当然,如果您不认为这是错误的话)。

In this case, robustness does not have anything to do with either data.table or tidyverse (you can use the calc_cummax function also by rewriting the tidyverse version, but you'd need to load data.table ).在这种情况下,健壮性与data.tabletidyverse没有任何tidyverse (您也可以通过重写tidyverse版本来使用calc_cummax函数,但您需要加载data.table )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM