What is the most efficient way to calculate time between dates in R?

Question

I have a large data set (as a csv) and wish to calculate the time between dates. What is the most efficient way to do this? eg, data:

ID    start        end
01    01-04-2017   05-04-2017
01    04-04-2017   06-04-2017
01    11-04-2017   21-04-2017
02    19-05-2017   22-05-2017
02    22-05-2017   24-05-2017
02    02-06-2017   05-06-2017
02    09-06-2017   12-06-2017
...

It's not so simple because there may be overlaps - as shown above.

What I'd like as output is:

ID    time
01    15
02    11
...

I've thought about splitting the data into a list based on the ID ( split(dataframe(df$start, df$end), df$ID) ) but this is slow for a large dataframe. I've also considered looping through the df and comparing the differences but this is also slow. Is there an efficient way to do this in R?

Answer 1

You can use findInterval to check which interval of start dates each value of end falls into. If they overlap, two will have the same interval, which can be used for grouping and aggregation to eliminate overlaps:

library(dplyr)

df <- read.table(text = 'ID    start        end
01    01-04-2017   05-04-2017
01    04-04-2017   06-04-2017
01    11-04-2017   21-04-2017
02    19-05-2017   22-05-2017
02    22-05-2017   24-05-2017
02    02-06-2017   05-06-2017
02    09-06-2017   12-06-2017', header = TRUE, colClasses = 'character') %>% 
    mutate_at(-1, as.Date, format = '%d-%m-%Y')    # parse dates

df_aggregated <- df %>% 
    group_by(ID) %>% 
    group_by(ID, overlap = findInterval(end, start)) %>% 
    summarise(start = min(start), end = max(end)) %>% 
    select(-overlap) %>% ungroup()    # clean up

df_aggregated
#> # A tibble: 5 × 3
#>      ID      start        end
#>   <chr>     <date>     <date>
#> 1    01 2017-04-01 2017-04-06
#> 2    01 2017-04-11 2017-04-21
#> 3    02 2017-05-19 2017-05-24
#> 4    02 2017-06-02 2017-06-05
#> 5    02 2017-06-09 2017-06-12

Once the data is tidied, summarizing is easy:

df_aggregated %>% group_by(ID) %>% summarise(span = sum(end - start))
#> # A tibble: 2 × 2
#>      ID    span
#>   <chr>  <time>
#> 1    01 15 days
#> 2    02 11 days

This approach assumes each group is ordered by start ; if not, add arrange(start) .

What is the most efficient way to calculate time between dates in R?

Question

1 answers

solution1
0 ACCPTED 2017-04-23 23:59:36

What is the most efficient way to calculate time between dates in R?

Question

1 answers

solution1 0 ACCPTED 2017-04-23 23:59:36

solution1
0 ACCPTED 2017-04-23 23:59:36