![](/img/trans.png)
[英]How to convert datetime to POSIXct or numeric/ aggregate timeseries intervals to hourly intervals?
[英]Aggregate timeseries intervals by hour
我有一個包含停車罰單,開始/結束時間以及購買地點(組)的信息的數據集。 我需要進行時間序列分析,以預測將來何時何地將購票。 為此,我需要將格式轉換為時間序列格式,並帶有在給定時間點有效的票證數量的值。
我的數據樣本:
library(lubridate)
timeseries <- data.frame(start = c("2016-12-31 20:42:00",
"2016-12-31 21:41:00",
"2016-12-31 21:15:00",
"2016-12-31 17:19:00",
"2016-12-31 21:47:00",
"2016-12-31 16:58:00"),
end = c("2016-12-31 23:07:00",
"2016-12-31 23:07:00",
"2016-12-31 23:08:00",
"2016-12-31 23:09:00",
"2016-12-31 23:11:00",
"2016-12-31 23:11:00"),
group = c(1,2,1,2,1,2),
stringsAsFactors = FALSE)
timeseries$start <- as.POSIXlt(timeseries$start)
timeseries$end <- as.POSIXlt(timeseries$end)
timeseries$interval <- interval(timeseries$start, timeseries$end, tzone="UTC")
我要匯總信息的時隙示例(按組):
summary_hours <- data.frame(timeStart = c("2016-12-31 16:00",
"2016-12-31 17:00",
"2016-12-31 18:00",
"2016-12-31 19:00",
"2016-12-31 20:00",
"2016-12-31 21:00",
"2016-12-31 22:00",
"2016-12-31 23:00"),
timeEnd = c("2016-12-31 17:00",
"2016-12-31 18:00",
"2016-12-31 19:00",
"2016-12-31 20:00",
"2016-12-31 21:00",
"2016-12-31 22:00",
"2016-12-31 23:00",
"2017-01-01 00:00"))
summary_hours$timeStart <- as.POSIXlt(summary_hours$timeStart)
summary_hours$timeEnd <- as.POSIXlt(summary_hours$timeEnd)
summary_hours$interval <- interval(summary_hours$timeStart, summary_hours$timeEnd, tzone="UTC")
當數據集跨越兩年時,我當前的方法似乎效率很低。
library("lubridate")
intersect_in_mins <- function(interval) {
return(as.period(intersect(interval, summary_hours$interval), "minutes")@minute)
}
summary_hours$group1 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 1)$interval, intersect_in_mins))), na.rm = TRUE)
summary_hours$group2 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 2)$interval, intersect_in_mins))), na.rm = TRUE)
summary_hours
timeStart timeEnd interval group1 group2
1 2016-12-31 16:00:00 2016-12-31 17:00:00 2016-12-31 16:00:00 UTC--2016-12-31 17:00:00 UTC 0 2
2 2016-12-31 17:00:00 2016-12-31 18:00:00 2016-12-31 17:00:00 UTC--2016-12-31 18:00:00 UTC 0 101
3 2016-12-31 18:00:00 2016-12-31 19:00:00 2016-12-31 18:00:00 UTC--2016-12-31 19:00:00 UTC 0 120
4 2016-12-31 19:00:00 2016-12-31 20:00:00 2016-12-31 19:00:00 UTC--2016-12-31 20:00:00 UTC 0 120
5 2016-12-31 20:00:00 2016-12-31 21:00:00 2016-12-31 20:00:00 UTC--2016-12-31 21:00:00 UTC 18 120
6 2016-12-31 21:00:00 2016-12-31 22:00:00 2016-12-31 21:00:00 UTC--2016-12-31 22:00:00 UTC 118 139
7 2016-12-31 22:00:00 2016-12-31 23:00:00 2016-12-31 22:00:00 UTC--2016-12-31 23:00:00 UTC 180 180
8 2016-12-31 23:00:00 2017-01-01 00:00:00 2016-12-31 23:00:00 UTC--2017-01-01 00:00:00 UTC 26 27
您對可以自動完成這種魔術的漂亮庫有什么建議嗎?
在他這里和這里的評論中,OP改變了問題的目的。 現在,該請求是針對一個小時的每個時間間隔匯總“活動票據的分鍾數” 。
這需要一種完全不同的方法,該方法有理由發布一個單獨的答案,恕我直言。
要檢查哪些票證在一個小時的時間間隔內處於活動狀態,可以使用data.table
包中的foverlaps()
函數:
library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# create sequence of intervals of one hour covering all given times
hours_seq <- timeseries[, {
tmp <- seq(lubridate::floor_date(min(start, end), "hour"),
lubridate::ceiling_date(max(start, end), "hour"),
by = "1 hour")
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}]
hours_seq
start end 1: 2016-12-31 16:00:00 2016-12-31 17:00:00 2: 2016-12-31 17:00:00 2016-12-31 18:00:00 3: 2016-12-31 18:00:00 2016-12-31 19:00:00 4: 2016-12-31 19:00:00 2016-12-31 20:00:00 5: 2016-12-31 20:00:00 2016-12-31 21:00:00 6: 2016-12-31 21:00:00 2016-12-31 22:00:00 7: 2016-12-31 22:00:00 2016-12-31 23:00:00 8: 2016-12-31 23:00:00 2017-01-01 00:00:00
# split up given ticket intervals in hour pieces
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
# compute active minutes and aggregate
, .(cnt_active_tickets = .N,
sum_active_minutes = sum(as.integer(
difftime(pmin(end, i.end), pmax(start, i.start), units = "mins")))),
keyby = .(group, interval_start = i.start, interval_end = i.end)]
group interval_start interval_end cnt_active_tickets sum_active_minutes 1: 1 2016-12-31 20:00:00 2016-12-31 21:00:00 1 18 2: 1 2016-12-31 21:00:00 2016-12-31 22:00:00 3 118 3: 1 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 4: 1 2016-12-31 23:00:00 2017-01-01 00:00:00 3 26 5: 2 2016-12-31 16:00:00 2016-12-31 17:00:00 1 2 6: 2 2016-12-31 17:00:00 2016-12-31 18:00:00 2 101 7: 2 2016-12-31 18:00:00 2016-12-31 19:00:00 2 120 8: 2 2016-12-31 19:00:00 2016-12-31 20:00:00 2 120 9: 2 2016-12-31 20:00:00 2016-12-31 21:00:00 2 120 10: 2 2016-12-31 21:00:00 2016-12-31 22:00:00 3 139 11: 2 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 12: 2 2016-12-31 23:00:00 2017-01-01 00:00:00 3 27
注意,該方法還考慮“短期停車者”,即活動時間少於一個小時並在整小時后開始並在下一個整小時之前結束的門票。
如果結果應與每個group
的值並排顯示,則可以使用dcast()
將數據從長格式dcast()
為寬格式:
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
, active_minutes := as.integer(
difftime(pmin(end, i.end), pmax(start, i.start), units = "mins"))][
, dcast(.SD, i.start + i.end ~ paste0("group", group), sum)]
i.start i.end group1 group2 1: 2016-12-31 16:00:00 2016-12-31 17:00:00 0 2 2: 2016-12-31 17:00:00 2016-12-31 18:00:00 0 101 3: 2016-12-31 18:00:00 2016-12-31 19:00:00 0 120 4: 2016-12-31 19:00:00 2016-12-31 20:00:00 0 120 5: 2016-12-31 20:00:00 2016-12-31 21:00:00 18 120 6: 2016-12-31 21:00:00 2016-12-31 22:00:00 118 139 7: 2016-12-31 22:00:00 2016-12-31 23:00:00 180 180 8: 2016-12-31 23:00:00 2017-01-01 00:00:00 26 27
OP已請求計算在給定時間點有效的票數 。
這可以通過將開始日期和結束日期以固定時間點的連續序列進行non-equi join
來實現:
library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# add id to each row (required to count the active tickets later)
timeseries[, rn := .I]
# print data for ilustration
timeseries[order(group, start, end)]
start end group rn 1: 2016-12-31 20:42:00 2016-12-31 23:07:00 1 1 2: 2016-12-31 21:15:00 2016-12-31 23:08:00 1 3 3: 2016-12-31 21:47:00 2016-12-31 23:11:00 1 5 4: 2016-12-31 16:58:00 2016-12-31 23:11:00 2 6 5: 2016-12-31 17:19:00 2016-12-31 23:09:00 2 4 6: 2016-12-31 21:41:00 2016-12-31 23:07:00 2 2
# create sequence of hourly timepoints
hours_seq <- timeseries[, seq(lubridate::floor_date(min(start, end), "hour"),
lubridate::ceiling_date(max(start, end), "hour"),
by = "1 hour")]
hours_seq
[1] "2016-12-31 16:00:00 UTC" "2016-12-31 17:00:00 UTC" "2016-12-31 18:00:00 UTC" "2016-12-31 19:00:00 UTC" [5] "2016-12-31 20:00:00 UTC" "2016-12-31 21:00:00 UTC" "2016-12-31 22:00:00 UTC" "2016-12-31 23:00:00 UTC" [9] "2017-01-01 00:00:00 UTC"
# non-equi join
timeseries[.(hr = hours_seq), on = .(start <= hr, end > hr), nomatch = 0L,
allow.cartesian = TRUE][
# count number of active tickets at timepoint and by group
, .(n.active.tickets = uniqueN(rn)), keyby = .(group, timepoint = start)]
group timepoint n.active.tickets 1: 1 2016-12-31 21:00:00 1 2: 1 2016-12-31 22:00:00 3 3: 1 2016-12-31 23:00:00 3 4: 2 2016-12-31 17:00:00 1 5: 2 2016-12-31 18:00:00 2 6: 2 2016-12-31 19:00:00 2 7: 2 2016-12-31 20:00:00 2 8: 2 2016-12-31 21:00:00 2 9: 2 2016-12-31 22:00:00 3 10: 2 2016-12-31 23:00:00 3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.