簡體   English   中英

如何使data.table中的組內的連續時間序列?

[英]How to make continuous time sequences within groups in data.table?

我有一個data.table其中包含來自不同位置(站點)的每小時觀測值的時間序列。 每個序列中都有間隔-缺少小時。 我想為每個站點填寫小時序列,因此每個序列每個小時都有一行(盡管會丟失數據,NA)。

示例數據:

library(data.table)
library(lubridate)

DT <- data.table(site = rep(LETTERS[1:2], each = 3),
                 date = ymd_h(c("2017080101", "2017080103", "2017080105",
                                "2017080103", "2017080105", "2017080107")),
                 # x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3), 
                 x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 
                 key = c("site", "date"))
DT
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 03:00:00 1.2
# 3:    A 2017-08-01 05:00:00 1.3
# 4:    B 2017-08-01 03:00:00 2.1
# 5:    B 2017-08-01 05:00:00 2.2
# 6:    B 2017-08-01 07:00:00 2.3

期望的結果DT2將包含每個站點的第一個(最小)日期和最后一個(最大)日期之間的所有小時數,其中x插入新行的位置丟失:

#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3

我試圖將DT與從min(date)max(date)構造的日期序列一起加入。 這是正確的方向,但日期范圍是所有網站上的,而不是每個網站上的,填入的行中缺少網站,並且排序順序(鍵)是錯誤的:

DT[.(seq(from = min(date), to = max(date), by = "hour")),
    .SD, on="date"]
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:   NA 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    B 2017-08-01 03:00:00 2.1
# 5:   NA 2017-08-01 04:00:00  NA
# 6:    A 2017-08-01 05:00:00 1.3
# 7:    B 2017-08-01 05:00:00 2.2
# 8:   NA 2017-08-01 06:00:00  NA
# 9:    B 2017-08-01 07:00:00 2.3

所以我自然地嘗試by = site添加:

DT[.(seq(from = min(date), to = max(date), by = "hour")),
   .SD, on="date", by=.(site)]
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 03:00:00 1.2
# 3:    A 2017-08-01 05:00:00 1.3
# 4:   NA                <NA>  NA
# 5:    B 2017-08-01 03:00:00 2.1
# 6:    B 2017-08-01 05:00:00 2.2
# 7:    B 2017-08-01 07:00:00 2.3

但這也不起作用。 誰能建議正確的data.table公式來給出上面顯示的所需填寫的DT2

library(data.table)
library(lubridate)  
setDT(DT)
test <- DT[, .(date = seq(min(date), max(date), by = 'hour')), by = 
              'site']
DT <- merge(test, DT, by = c('site', 'date'), all.x = TRUE)


DT
    site                date   x
 1:    A 2017-08-01 01:00:00 1.1
 2:    A 2017-08-01 02:00:00  NA
 3:    A 2017-08-01 03:00:00 1.2
 4:    A 2017-08-01 04:00:00  NA
 5:    A 2017-08-01 05:00:00 1.3
 6:    B 2017-08-01 03:00:00 2.1
 7:    B 2017-08-01 04:00:00  NA
 8:    B 2017-08-01 05:00:00 2.2
 9:    B 2017-08-01 06:00:00  NA
10:    B 2017-08-01 07:00:00 2.3

感謝弗蘭克(Frank)和溫(Wen)使我走上正確的道路。 我找到了一個緊湊的data.table解決方案。 如輸入表中所示,結果DT2也在現場和日期上鍵入(這是合乎需要的,盡管我在OP中未要求輸入)。 這是Wen的data.table語法中的解決方案的重新data.table ,我認為在大型數據集上它的效率會稍高一些。

DT2 <- DT[setkey(DT[, .(date = seq(from = min(date), to = max(date), 
                         by = "hour")), by = site], site, date), ]
DT2
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT2)
# [1] "site" "date"

編輯1:正如弗蘭克提到的,也可以使用on=語法。 下面的DT3公式給出了正確的答案,但是DT3未設置鍵,而DT2結果設置了鍵。 這意味着如果需要鍵入結果,則需要一個“額外的” setkey()

DT3 <- DT[DT[, .(date = seq(from = min(date), to = max(date), 
                  by = "hour")), by = site], on = c("site", "date"), ]
DT3
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT3)
# NULL
all.equal(DT2, DT3)
# [1] "Datasets has different keys. 'target': site, date. 'current' has no key."
all.equal(DT2, DT3, check.attributes = FALSE)
# [1] TRUE

除了明確使用setkey()之外,是否有其他方法可以編寫DT3表達式以提供鍵控結果?

編輯2:弗蘭克的評論建議使用keyby = .EACHI的附加公式DT4 在這種情況下, .SD作為j插入,這在使用bykeyby時是必需的。 這樣可以給出正確的答案,並且像DT2公式一樣對結果進行鍵控。

DT4 <- DT[DT[, .(date = seq(from = min(date), to = max(date), by = "hour")), 
             by = site], on = c("site", "date"), .SD, keyby = .EACHI]
DT4
#    site                date   x
# 1:    A 2017-08-01 01:00:00 1.1
# 2:    A 2017-08-01 02:00:00  NA
# 3:    A 2017-08-01 03:00:00 1.2
# 4:    A 2017-08-01 04:00:00  NA
# 5:    A 2017-08-01 05:00:00 1.3
# 6:    B 2017-08-01 03:00:00 2.1
# 7:    B 2017-08-01 04:00:00  NA
# 8:    B 2017-08-01 05:00:00 2.2
# 9:    B 2017-08-01 06:00:00  NA
#10:    B 2017-08-01 07:00:00 2.3
key(DT4)
# [1] "site" "date"
identical(DT2, DT4)
# [1] TRUE

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM