R：基於事件時間傳播時間序列數據

Question

我有一個大型時間序列數據集，該數據集當前遍歷數據以將時間序列數據更改為除以時間間隔的事件。 我正在尋找比迭代更巧妙的東西，因為我的數據量很大，這會變得非常慢。 我的起始 dataframe 看起來類似於這個簡單的：

structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), datetime = structure(c(1597203000, 
1597201200, 1597199400, 1597186800, 1597185000, 1597183200, 1597197600, 
1597195800, 1597194000, 1597181400, 1597179600, 1597177800, 1597192200, 
1597190400, 1597188600, 1597176000, 1597174200, 1597172400), class = c("POSIXct", 
"POSIXt"), tzone = ""), percent = c(0, 0, 2, 1, 0, 0, 0, 0, 3, 
4, 0, 0, 0, 0, 0, 5, 0, 0)), class = "data.frame", row.names = c(NA, 
-18L))

數據是半小時一次，所以如果一個Name變量有兩個連續的半小時datetime值，我認為它是事件的一部分。 我也會給予一些寬大處理，所以如果數據沒有顯示連續的半小時值，但有連續的小時值，那也可以。 所以目標是返回一個看起來像這樣的 dataframe：

structure(list(Name = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), startdate = structure(c(1597203000, 
1597197600, 1597192200, 1597186800, 1597181400, 1597176000), class = c("POSIXct", 
"POSIXt"), tzone = ""), enddate = structure(c(1597199400, 1597194000, 
1597188600, 1597183200, 1597177800, 1597172400), class = c("POSIXct", 
"POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA, 
-6L))

在此先感謝您提供任何時髦的解決方案，我非常感謝！

編輯：日期時間值不一定按照列表的順序排列。

Answer 1

我不確定你的循環是什么樣子的，但如果你使用下面的代碼，你可以把循環關到很晚，讓事情至少運行得更快一點。

df= with(df, df[order(Name, datetime),]) %>% 
         mutate(dftime = difftime(lead(datetime),datetime, units = "mins")) %>%
         mutate(eventnum = 0)

i = 1
j = 1
for(i in 1:length(df$eventnum)){
  if(df$dftime[i] <= 60){          # accounting for your consecutive hours comment
    df$eventnum[i] = j
  } else{df$eventnum[i] = j
         j = j + 1}
  i = i + 1
}

然后，您可以使用總結性設置，例如他在此處分享的 akrun 的答案，如下所示：

df_lengths = df %>% group_by(eventnum, Name) %>% 
                     summarise(startdate = first(datetime), enddate = last(datetime)) %>% 
                     ungroup %>% select(-eventnum)

但這只是一個更好的答案，假設您在數據組織中較早地進行循環，例如，如果您循環通過時間差計算以及間隔檢查。

Answer 2

在“名稱”列上使用rleid （來自data.table ）創建一個分組變量，然后通過返回兩列中的first和last元素來summarise “日期時間”列

library(data.table)
library(dplyr)
df1 %>%
   group_by(grp = rleid(Name), Name) %>% 
   summarise(startdate = first(datetime), enddate = last(datetime)) %>%
   ungroup %>%
   select(-grp)
# A tibble: 6 x 3
#  Name  startdate           enddate            
#  <fct> <dttm>              <dttm>             
#1 a     2020-08-11 22:30:00 2020-08-11 21:30:00
#2 b     2020-08-11 21:00:00 2020-08-11 20:00:00
#3 c     2020-08-11 19:30:00 2020-08-11 18:30:00
#4 a     2020-08-11 18:00:00 2020-08-11 17:00:00
#5 b     2020-08-11 16:30:00 2020-08-11 15:30:00
#6 c     2020-08-11 15:00:00 2020-08-11 14:00:00

R：基於事件時間傳播時間序列數據

問題描述

2 個解決方案

解決方案1
1 已采納 2020-08-19 14:55:24

解決方案2
-1 2020-08-12 20:19:34

R：基於事件時間傳播時間序列數據

問題描述

2 個解決方案

解決方案1 1 已采納 2020-08-19 14:55:24

解決方案2 -1 2020-08-12 20:19:34

解決方案1
1 已采納 2020-08-19 14:55:24

解決方案2
-1 2020-08-12 20:19:34