簡體   English   中英

如何根據給定的時間戳值(給出的示例)在 R 中創建一個間隔為 X 分鍾的時間戳桶?

[英]How to create a bucket of timestamps with a gap of X minutes in R from given timestamp values (example given)?

我有一個 mydf 表,它有帶有設備的 time_stamps 列。 只要 2 個連續 time_stamps 之間的差異等於或小於 30 分鍾,我就想繼續合並 time_stamps。 開始 time_stamp 將被標記為 start_timestamp 並且當間隔超過 30 分鍾時,我將結束該訪問並將該結束分類為 end_timestamps,如下面給出的示例所示

df<-data.frame(customer=rep("XYZ",4),device=rep("x",4),time_stamps=c("2020-05-13 07:50:06","2020-05-13 07:55:06","2020-05-13 08:05:06","2020-05-13 08:50:06"))
df1<-data.frame(customer=rep("XYZ",3),device=rep("y",3),time_stamps=c("2020-05-14 07:50:06","2020-05-14 08:15:06","2020-05-14 08:25:06"))
df2<-data.frame(customer=rep("XYZ",1),device=rep("z",1),time_stamps=c("2020-05-16 09:50:06"))
df3<-data.frame(customer=rep("XYZ",2),device=rep("a",2),time_stamps=c("2020-05-16 09:50:06","2020-05-16 19:50:06"))
df4<-data.frame(customer=rep("XYZ",2),device=rep("b",2),time_stamps=c("2020-05-17 09:50:06","2020-05-17 10:15:06"))
df5<-data.frame(customer=rep("XYZ",4),device=rep("c",4),time_stamps=c("2020-05-13 07:50:06","2020-05-13 07:55:06","2020-05-13 08:05:06","2020-05-13 08:32:06"))

mydf<-rbind(df,df1,df2,df3,df4,df5)

這是我預期的數據框

expected_df<-data.frame(customer=rep("XYZ",8),device=c("x","x","y","z","a","a","b","c"),
        start_timestamp=c("2020-05-13 07:50:06","2020-05-13 08:50:06","2020-05-14 07:50:06","2020-05-16 09:50:06","2020-05-16 09:50:06","2020-05-16 19:50:06","2020-05-17 09:50:06","2020-05-13 07:50:06"),
        end_startstamp=c("2020-05-13 08:05:06","2020-05-13 08:50:06","2020-05-14 08:25:06","2020-05-16 09:50:06","2020-05-16 09:50:06","2020-05-16 19:50:06","2020-05-17 10:15:06","2020-05-13 08:32:06"))

關鍵是創建我們可以group_by的組。 為此,我們識別出彼此相差30 * 60秒內的那些記錄,然后使用rle合並它們:

library(dplyr)

mydf %>% 
  group_by(customer, device) %>% 
  mutate(time_stamps = as.POSIXct(time_stamps),
         diff = time_stamps - lag(time_stamps, default = first(time_stamps)),
         same_group_as_lag = diff <= 30*60,
         group = with(rle(same_group_as_lag), rep(seq_along(lengths), lengths)))
#> # A tibble: 16 x 6
#> # Groups:   customer, device [6]
#>    customer device time_stamps         diff       same_group_as_lag  group
#>    <fct>    <fct>  <dttm>              <drtn>     <lgl>              <int>
#>  1 XYZ      x      2020-05-13 07:50:06     0 secs TRUE                   1
#>  2 XYZ      x      2020-05-13 07:55:06   300 secs TRUE                   1
#>  3 XYZ      x      2020-05-13 08:05:06   600 secs TRUE                   1
#>  4 XYZ      x      2020-05-13 08:50:06  2700 secs FALSE                  2
#>  5 XYZ      y      2020-05-14 07:50:06     0 secs TRUE                   1
#>  6 XYZ      y      2020-05-14 08:15:06  1500 secs TRUE                   1
#>  7 XYZ      y      2020-05-14 08:25:06   600 secs TRUE                   1
#>  8 XYZ      z      2020-05-16 09:50:06     0 secs TRUE                   1
#>  9 XYZ      a      2020-05-16 09:50:06     0 secs TRUE                   1
#> 10 XYZ      a      2020-05-16 19:50:06 36000 secs FALSE                  2
#> 11 XYZ      b      2020-05-17 09:50:06     0 secs TRUE                   1
#> 12 XYZ      b      2020-05-17 10:15:06  1500 secs TRUE                   1
#> 13 XYZ      c      2020-05-13 07:50:06     0 secs TRUE                   1
#> 14 XYZ      c      2020-05-13 07:55:06   300 secs TRUE                   1
#> 15 XYZ      c      2020-05-13 08:05:06   600 secs TRUE                   1
#> 16 XYZ      c      2020-05-13 08:32:06  1620 secs TRUE                   1

然后只是總結一下:

mydf %>% 
  group_by(customer, device) %>% 
  mutate(time_stamps = as.POSIXct(time_stamps),
         diff = time_stamps - lag(time_stamps, default = first(time_stamps)),
         same_group_as_lag = diff <= 30*60,
         group = with(rle(same_group_as_lag), rep(seq_along(lengths), lengths))) %>% 
  group_by(group, add = TRUE) %>% 
  summarise(start_timestamp = min(time_stamps),
            end_startstamp = max(time_stamps))
#> # A tibble: 8 x 5
#> # Groups:   customer, device [6]
#>   customer device group start_timestamp     end_startstamp     
#>   <fct>    <fct>  <int> <dttm>              <dttm>             
#> 1 XYZ      x          1 2020-05-13 07:50:06 2020-05-13 08:05:06
#> 2 XYZ      x          2 2020-05-13 08:50:06 2020-05-13 08:50:06
#> 3 XYZ      y          1 2020-05-14 07:50:06 2020-05-14 08:25:06
#> 4 XYZ      z          1 2020-05-16 09:50:06 2020-05-16 09:50:06
#> 5 XYZ      a          1 2020-05-16 09:50:06 2020-05-16 09:50:06
#> 6 XYZ      a          2 2020-05-16 19:50:06 2020-05-16 19:50:06
#> 7 XYZ      b          1 2020-05-17 09:50:06 2020-05-17 10:15:06
#> 8 XYZ      c          1 2020-05-13 07:50:06 2020-05-13 08:32:06

代表 package (v0.3.0) 於 2020 年 6 月 25 日創建

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM