[英]Aggregate dates into date intervals / periods in R
我有以下示例数据:
require(tibble)
sample_data <- tibble(
emp_name = c("john", "john", "john", "john","john","john", "john"),
task = c("carpenter", "carpenter","carpenter", "painter", "painter", "carpenter", "carpenter"),
date_stamp = c("2019-01-01","2019-01-02", "2019-01-03", "2019-01-07", "2019-01-08", "2019-01-30", "2019-02-02")
)
为此,我需要根据日期聚合成间隔。
规则是:如果为同一属性列出的下一个date_stamp之间没有日期,那么它应该被聚合。 否则, date_stamp_from和date_stamp_to应该等于date_stamp 。
desired_result <- tibble(
emp_name = c("john", "john","john", "john"),
task = c("carpenter","painter", "carpenter", "carpenter"),
date_stamp_from = c("2019-01-01","2019-01-07", "2019-01-30", "2019-02-02"),
date_stamp_to = c("2019-01-03","2019-01-08", "2019-01-30", "2019-02-02"),
count_dates = c(3,2,1,1)
)
解决这个问题的最有效方法是什么? 原始数据集大约有 10000 条记录。
我们可以使用diff
和cumsum
创建组并计算每个组中的first
、 last
date_stamp
和行数。
library(dplyr)
sample_data %>%
mutate(date_stamp = as.Date(date_stamp)) %>%
group_by(gr = cumsum(c(TRUE, diff(date_stamp) > 1))) %>%
mutate(date_stamp_from = first(date_stamp),
date_stamp_to = last(date_stamp),
count_dates = n()) %>%
slice(1L) %>%
ungroup() %>%
select(-gr, -date_stamp)
# A tibble: 4 x 5
# emp_name task date_stamp_from date_stamp_to count_dates
# <chr> <chr> <date> <date> <int>
#1 john carpenter 2019-01-01 2019-01-03 3
#2 john painter 2019-01-07 2019-01-08 2
#3 john carpenter 2019-01-30 2019-01-30 1
#4 john carpenter 2019-02-02 2019-02-02 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.