简体   繁体   中英

How to compare two rows of dates in R and transform into one row

I have a data set which contains multiple rows of date information (intervals) for the same names, which should be compared and eventually transformed into one row. I'd like to achieve the following:

  • If the intervals are overlapping, then keep one row with the earliest and the latest date of the four values
  • If the intervals are not overlapping, but the time between intervals is less or equal to 60 days, do the same as above: thus, keep one row with the earliest and latest date of the four values
  • If the intervals are not overlapping, and the time between intervals is more than 60 days, do nothing (keep both rows)

Data:

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)

Desired result:

names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)

Transform the dates:

df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")

Here's one way (probably not the most concise) using dplyr . First we convert the dates to Date format, then for each name

  1. figure out if the second interval starts more than 60 days after the first. If so, we tag both rows as keep_both . We sorted the dates so we know the second row comes later.
  2. for rows that aren't marked keep_both , get the min and max dates. Note that I'm assuming the intervals are in the right order, ie date2 is later than date1 in each row here.
  3. filter the data to keep just the first row from each name unless we are keeping both.

Output matches your desired output except for a typo on Rick.

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)

library(tidyverse)

df1 %>%
  mutate(across(c(date1, date2), lubridate::dmy)) %>%
  arrange(names, date1, date2) %>%
  group_by(names) %>%
  mutate(
    keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
    new_date1 = if_else(keep_both, date1, min(date1)),
    new_date2 = if_else(keep_both, date2, max(date2)),
  ) %>%
  filter(keep_both | row_number() == 1) %>%
  select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups:   names [4]
#>   names date1      date2     
#>   <chr> <date>     <date>    
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John  2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick  2018-01-13 2020-03-02

Created on 2020-07-13 by the reprex package (v0.3.0)

I used slightly altered sample data, to make sure intervals that are <= 60 days apart from each other, get joined as described in the question..

sample data

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)

  names     date1     date2
1  John  1-3-2016 16-4-2020
2  John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3  Rick 13-1-2018  2-3-2020
4  Rick  4-2-2020 16-2-2020
5 Katie  5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry  4-2-2020  8-4-2020

names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)

  names     date1     date2
1  John  1-3-2016 28-5-2020  # !! <-- joined, since gap <= 60 days
2  Rick 13-1-2018  2-3-2020  # !! <-- fixed type in your sample data provided
3 Katie  5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry  4-2-2020  8-4-2020

code

The code makes use of the data.table and the intervals -packages. Since intervals only work on numeric (integer or real) intervals, the date-columns get converted to numeric before interval-creation/-extension/-merging, and back to date-format after processing.

What the code below does:

  1. Loop ( lapply() over cuncks of split (using data.table::split() with the by-argument . , by name. keep.by = FALSE is used, since we do not need it, and the by-name is also stored in the names of the created list.
  2. For each cunck (=name), defines intervals based on the two date-columns, and extend these intervals by half the value of the gap -variable (set to 60 in the code below). So each interval gets extended by 30 days before and after.
    It then joins overlapping/touching (extended) intervals, and finally removes the extensions.
  3. Use data.table::rbindlist() , to merge all the results back together.
  4. set colnames, and convert numeric values dates back to data-format

.

library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
  lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
    as.data.table(
    intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand( 
            intervals::Intervals( x, type = "Z" ), 
            gap/2 ) ), gap/2 ) 
      )
    )
  }),
  use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]

output

   names      date1      date2
1:  John 2016-03-01 2020-05-28
2:  Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM