I have a large data set (as a csv) and wish to calculate the time between dates. What is the most efficient way to do this? eg, data:
ID start end
01 01-04-2017 05-04-2017
01 04-04-2017 06-04-2017
01 11-04-2017 21-04-2017
02 19-05-2017 22-05-2017
02 22-05-2017 24-05-2017
02 02-06-2017 05-06-2017
02 09-06-2017 12-06-2017
...
It's not so simple because there may be overlaps - as shown above.
What I'd like as output is:
ID time
01 15
02 11
...
I've thought about splitting the data into a list based on the ID ( split(dataframe(df$start, df$end), df$ID)
) but this is slow for a large dataframe. I've also considered looping through the df and comparing the differences but this is also slow. Is there an efficient way to do this in R?
You can use findInterval
to check which interval of start dates each value of end
falls into. If they overlap, two will have the same interval, which can be used for grouping and aggregation to eliminate overlaps:
library(dplyr)
df <- read.table(text = 'ID start end
01 01-04-2017 05-04-2017
01 04-04-2017 06-04-2017
01 11-04-2017 21-04-2017
02 19-05-2017 22-05-2017
02 22-05-2017 24-05-2017
02 02-06-2017 05-06-2017
02 09-06-2017 12-06-2017', header = TRUE, colClasses = 'character') %>%
mutate_at(-1, as.Date, format = '%d-%m-%Y') # parse dates
df_aggregated <- df %>%
group_by(ID) %>%
group_by(ID, overlap = findInterval(end, start)) %>%
summarise(start = min(start), end = max(end)) %>%
select(-overlap) %>% ungroup() # clean up
df_aggregated
#> # A tibble: 5 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 01 2017-04-01 2017-04-06
#> 2 01 2017-04-11 2017-04-21
#> 3 02 2017-05-19 2017-05-24
#> 4 02 2017-06-02 2017-06-05
#> 5 02 2017-06-09 2017-06-12
Once the data is tidied, summarizing is easy:
df_aggregated %>% group_by(ID) %>% summarise(span = sum(end - start))
#> # A tibble: 2 × 2
#> ID span
#> <chr> <time>
#> 1 01 15 days
#> 2 02 11 days
This approach assumes each group is ordered by start
; if not, add arrange(start)
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.