按小時匯總時間序列間隔

Question

我有一個包含停車罰單，開始/結束時間以及購買地點（組）的信息的數據集。 我需要進行時間序列分析，以預測將來何時何地將購票。 為此，我需要將格式轉換為時間序列格式，並帶有在給定時間點有效的票證數量的值。

我的數據樣本：

library(lubridate)
timeseries <- data.frame(start = c("2016-12-31 20:42:00",
                                   "2016-12-31 21:41:00",
                                   "2016-12-31 21:15:00",
                                   "2016-12-31 17:19:00",
                                   "2016-12-31 21:47:00",
                                   "2016-12-31 16:58:00"),
                         end = c("2016-12-31 23:07:00",
                                 "2016-12-31 23:07:00",
                                 "2016-12-31 23:08:00",
                                 "2016-12-31 23:09:00",
                                 "2016-12-31 23:11:00",
                                 "2016-12-31 23:11:00"),
                         group = c(1,2,1,2,1,2),
                         stringsAsFactors = FALSE)
timeseries$start <- as.POSIXlt(timeseries$start)
timeseries$end <- as.POSIXlt(timeseries$end)
timeseries$interval <- interval(timeseries$start, timeseries$end, tzone="UTC")

我要匯總信息的時隙示例（按組）：

summary_hours <- data.frame(timeStart = c("2016-12-31 16:00",
                                          "2016-12-31 17:00",
                                          "2016-12-31 18:00",
                                          "2016-12-31 19:00",
                                          "2016-12-31 20:00",
                                          "2016-12-31 21:00",
                                          "2016-12-31 22:00",
                                          "2016-12-31 23:00"),
                            timeEnd = c("2016-12-31 17:00",
                                        "2016-12-31 18:00",
                                        "2016-12-31 19:00",
                                        "2016-12-31 20:00",
                                        "2016-12-31 21:00",
                                        "2016-12-31 22:00",
                                        "2016-12-31 23:00",
                                        "2017-01-01 00:00"))
summary_hours$timeStart <- as.POSIXlt(summary_hours$timeStart)
summary_hours$timeEnd <- as.POSIXlt(summary_hours$timeEnd)
summary_hours$interval <- interval(summary_hours$timeStart, summary_hours$timeEnd, tzone="UTC")

當數據集跨越兩年時，我當前的方法似乎效率很低。

library("lubridate")
intersect_in_mins <- function(interval) {
  return(as.period(intersect(interval, summary_hours$interval), "minutes")@minute)
}

summary_hours$group1 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 1)$interval, intersect_in_mins))), na.rm = TRUE)
summary_hours$group2 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 2)$interval, intersect_in_mins))), na.rm = TRUE)

summary_hours
            timeStart             timeEnd                                         interval group1 group2
1 2016-12-31 16:00:00 2016-12-31 17:00:00 2016-12-31 16:00:00 UTC--2016-12-31 17:00:00 UTC      0      2
2 2016-12-31 17:00:00 2016-12-31 18:00:00 2016-12-31 17:00:00 UTC--2016-12-31 18:00:00 UTC      0    101
3 2016-12-31 18:00:00 2016-12-31 19:00:00 2016-12-31 18:00:00 UTC--2016-12-31 19:00:00 UTC      0    120
4 2016-12-31 19:00:00 2016-12-31 20:00:00 2016-12-31 19:00:00 UTC--2016-12-31 20:00:00 UTC      0    120
5 2016-12-31 20:00:00 2016-12-31 21:00:00 2016-12-31 20:00:00 UTC--2016-12-31 21:00:00 UTC     18    120
6 2016-12-31 21:00:00 2016-12-31 22:00:00 2016-12-31 21:00:00 UTC--2016-12-31 22:00:00 UTC    118    139
7 2016-12-31 22:00:00 2016-12-31 23:00:00 2016-12-31 22:00:00 UTC--2016-12-31 23:00:00 UTC    180    180
8 2016-12-31 23:00:00 2017-01-01 00:00:00 2016-12-31 23:00:00 UTC--2017-01-01 00:00:00 UTC     26     27

您對可以自動完成這種魔術的漂亮庫有什么建議嗎？

Answer 1

在他這里和這里的評論中，OP改變了問題的目的。 現在，該請求是針對一個小時的每個時間間隔匯總“活動票據的分鍾數” 。

這需要一種完全不同的方法，該方法有理由發布一個單獨的答案，恕我直言。

要檢查哪些票證在一個小時的時間間隔內處於活動狀態，可以使用data.table包中的foverlaps()函數：

library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]

# create sequence of intervals of one hour covering all given times
hours_seq <- timeseries[, {
  tmp <- seq(lubridate::floor_date(min(start, end), "hour"),
             lubridate::ceiling_date(max(start, end), "hour"), 
             by = "1 hour")
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
  }]
hours_seq

  start end 1: 2016-12-31 16:00:00 2016-12-31 17:00:00 2: 2016-12-31 17:00:00 2016-12-31 18:00:00 3: 2016-12-31 18:00:00 2016-12-31 19:00:00 4: 2016-12-31 19:00:00 2016-12-31 20:00:00 5: 2016-12-31 20:00:00 2016-12-31 21:00:00 6: 2016-12-31 21:00:00 2016-12-31 22:00:00 7: 2016-12-31 22:00:00 2016-12-31 23:00:00 8: 2016-12-31 23:00:00 2017-01-01 00:00:00

# split up given ticket intervals in hour pieces 
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
  # compute active minutes and aggregate
  , .(cnt_active_tickets = .N, 
      sum_active_minutes = sum(as.integer(
        difftime(pmin(end, i.end), pmax(start, i.start), units = "mins")))), 
    keyby = .(group, interval_start = i.start, interval_end = i.end)]

  group interval_start interval_end cnt_active_tickets sum_active_minutes 1: 1 2016-12-31 20:00:00 2016-12-31 21:00:00 1 18 2: 1 2016-12-31 21:00:00 2016-12-31 22:00:00 3 118 3: 1 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 4: 1 2016-12-31 23:00:00 2017-01-01 00:00:00 3 26 5: 2 2016-12-31 16:00:00 2016-12-31 17:00:00 1 2 6: 2 2016-12-31 17:00:00 2016-12-31 18:00:00 2 101 7: 2 2016-12-31 18:00:00 2016-12-31 19:00:00 2 120 8: 2 2016-12-31 19:00:00 2016-12-31 20:00:00 2 120 9: 2 2016-12-31 20:00:00 2016-12-31 21:00:00 2 120 10: 2 2016-12-31 21:00:00 2016-12-31 22:00:00 3 139 11: 2 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 12: 2 2016-12-31 23:00:00 2017-01-01 00:00:00 3 27

注意，該方法還考慮“短期停車者”，即活動時間少於一個小時並在整小時后開始並在下一個整小時之前結束的門票。

寬格式輸出

如果結果應與每個group的值並排顯示，則可以使用dcast()將數據從長格式dcast()為寬格式：

foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
  , active_minutes := as.integer(
    difftime(pmin(end, i.end), pmax(start, i.start), units = "mins"))][
      , dcast(.SD, i.start + i.end ~ paste0("group", group), sum)]

  i.start i.end group1 group2 1: 2016-12-31 16:00:00 2016-12-31 17:00:00 0 2 2: 2016-12-31 17:00:00 2016-12-31 18:00:00 0 101 3: 2016-12-31 18:00:00 2016-12-31 19:00:00 0 120 4: 2016-12-31 19:00:00 2016-12-31 20:00:00 0 120 5: 2016-12-31 20:00:00 2016-12-31 21:00:00 18 120 6: 2016-12-31 21:00:00 2016-12-31 22:00:00 118 139 7: 2016-12-31 22:00:00 2016-12-31 23:00:00 180 180 8: 2016-12-31 23:00:00 2017-01-01 00:00:00 26 27

Answer 2

OP已請求計算在給定時間點有效的票數 。

這可以通過將開始日期和結束日期以固定時間點的連續序列進行non-equi join來實現：

library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")

# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# add id to each row (required to count the active tickets later)
timeseries[, rn := .I]
# print data for ilustration
timeseries[order(group, start, end)]

  start end group rn 1: 2016-12-31 20:42:00 2016-12-31 23:07:00 1 1 2: 2016-12-31 21:15:00 2016-12-31 23:08:00 1 3 3: 2016-12-31 21:47:00 2016-12-31 23:11:00 1 5 4: 2016-12-31 16:58:00 2016-12-31 23:11:00 2 6 5: 2016-12-31 17:19:00 2016-12-31 23:09:00 2 4 6: 2016-12-31 21:41:00 2016-12-31 23:07:00 2 2

# create sequence of hourly timepoints
hours_seq <- timeseries[, seq(lubridate::floor_date(min(start, end), "hour"),
                              lubridate::ceiling_date(max(start, end), "hour"), 
                              by = "1 hour")]
hours_seq

 [1] "2016-12-31 16:00:00 UTC" "2016-12-31 17:00:00 UTC" "2016-12-31 18:00:00 UTC" "2016-12-31 19:00:00 UTC" [5] "2016-12-31 20:00:00 UTC" "2016-12-31 21:00:00 UTC" "2016-12-31 22:00:00 UTC" "2016-12-31 23:00:00 UTC" [9] "2017-01-01 00:00:00 UTC"

# non-equi join
timeseries[.(hr = hours_seq), on = .(start <= hr, end > hr), nomatch = 0L,
           allow.cartesian = TRUE][
             # count number of active tickets at timepoint and by group
             , .(n.active.tickets = uniqueN(rn)), keyby = .(group, timepoint = start)]

  group timepoint n.active.tickets 1: 1 2016-12-31 21:00:00 1 2: 1 2016-12-31 22:00:00 3 3: 1 2016-12-31 23:00:00 3 4: 2 2016-12-31 17:00:00 1 5: 2 2016-12-31 18:00:00 2 6: 2 2016-12-31 19:00:00 2 7: 2 2016-12-31 20:00:00 2 8: 2 2016-12-31 21:00:00 2 9: 2 2016-12-31 22:00:00 3 10: 2 2016-12-31 23:00:00 3

按小時匯總時間序列間隔

問題描述

2 個解決方案

解決方案1
3 已采納 2017-09-20 17:20:38

寬格式輸出

解決方案2
2 2017-09-19 14:58:43

按小時匯總時間序列間隔

問題描述

2 個解決方案

解決方案1 3 已采納 2017-09-20 17:20:38

寬格式輸出

解決方案2 2 2017-09-19 14:58:43

解決方案1
3 已采納 2017-09-20 17:20:38

解決方案2
2 2017-09-19 14:58:43