![](/img/trans.png)
[英]How to count observations between 2 timestamps in R (Example given)?
[英]How to create a bucket of timestamps with a gap of X minutes in R from given timestamp values (example given)?
我有一個 mydf 表,它有帶有設備的 time_stamps 列。 只要 2 個連續 time_stamps 之間的差異等於或小於 30 分鍾,我就想繼續合並 time_stamps。 開始 time_stamp 將被標記為 start_timestamp 並且當間隔超過 30 分鍾時,我將結束該訪問並將該結束分類為 end_timestamps,如下面給出的示例所示
df<-data.frame(customer=rep("XYZ",4),device=rep("x",4),time_stamps=c("2020-05-13 07:50:06","2020-05-13 07:55:06","2020-05-13 08:05:06","2020-05-13 08:50:06"))
df1<-data.frame(customer=rep("XYZ",3),device=rep("y",3),time_stamps=c("2020-05-14 07:50:06","2020-05-14 08:15:06","2020-05-14 08:25:06"))
df2<-data.frame(customer=rep("XYZ",1),device=rep("z",1),time_stamps=c("2020-05-16 09:50:06"))
df3<-data.frame(customer=rep("XYZ",2),device=rep("a",2),time_stamps=c("2020-05-16 09:50:06","2020-05-16 19:50:06"))
df4<-data.frame(customer=rep("XYZ",2),device=rep("b",2),time_stamps=c("2020-05-17 09:50:06","2020-05-17 10:15:06"))
df5<-data.frame(customer=rep("XYZ",4),device=rep("c",4),time_stamps=c("2020-05-13 07:50:06","2020-05-13 07:55:06","2020-05-13 08:05:06","2020-05-13 08:32:06"))
mydf<-rbind(df,df1,df2,df3,df4,df5)
這是我預期的數據框
expected_df<-data.frame(customer=rep("XYZ",8),device=c("x","x","y","z","a","a","b","c"),
start_timestamp=c("2020-05-13 07:50:06","2020-05-13 08:50:06","2020-05-14 07:50:06","2020-05-16 09:50:06","2020-05-16 09:50:06","2020-05-16 19:50:06","2020-05-17 09:50:06","2020-05-13 07:50:06"),
end_startstamp=c("2020-05-13 08:05:06","2020-05-13 08:50:06","2020-05-14 08:25:06","2020-05-16 09:50:06","2020-05-16 09:50:06","2020-05-16 19:50:06","2020-05-17 10:15:06","2020-05-13 08:32:06"))
關鍵是創建我們可以group_by
的組。 為此,我們識別出彼此相差30 * 60
秒內的那些記錄,然后使用rle
合並它們:
library(dplyr)
mydf %>%
group_by(customer, device) %>%
mutate(time_stamps = as.POSIXct(time_stamps),
diff = time_stamps - lag(time_stamps, default = first(time_stamps)),
same_group_as_lag = diff <= 30*60,
group = with(rle(same_group_as_lag), rep(seq_along(lengths), lengths)))
#> # A tibble: 16 x 6
#> # Groups: customer, device [6]
#> customer device time_stamps diff same_group_as_lag group
#> <fct> <fct> <dttm> <drtn> <lgl> <int>
#> 1 XYZ x 2020-05-13 07:50:06 0 secs TRUE 1
#> 2 XYZ x 2020-05-13 07:55:06 300 secs TRUE 1
#> 3 XYZ x 2020-05-13 08:05:06 600 secs TRUE 1
#> 4 XYZ x 2020-05-13 08:50:06 2700 secs FALSE 2
#> 5 XYZ y 2020-05-14 07:50:06 0 secs TRUE 1
#> 6 XYZ y 2020-05-14 08:15:06 1500 secs TRUE 1
#> 7 XYZ y 2020-05-14 08:25:06 600 secs TRUE 1
#> 8 XYZ z 2020-05-16 09:50:06 0 secs TRUE 1
#> 9 XYZ a 2020-05-16 09:50:06 0 secs TRUE 1
#> 10 XYZ a 2020-05-16 19:50:06 36000 secs FALSE 2
#> 11 XYZ b 2020-05-17 09:50:06 0 secs TRUE 1
#> 12 XYZ b 2020-05-17 10:15:06 1500 secs TRUE 1
#> 13 XYZ c 2020-05-13 07:50:06 0 secs TRUE 1
#> 14 XYZ c 2020-05-13 07:55:06 300 secs TRUE 1
#> 15 XYZ c 2020-05-13 08:05:06 600 secs TRUE 1
#> 16 XYZ c 2020-05-13 08:32:06 1620 secs TRUE 1
然后只是總結一下:
mydf %>%
group_by(customer, device) %>%
mutate(time_stamps = as.POSIXct(time_stamps),
diff = time_stamps - lag(time_stamps, default = first(time_stamps)),
same_group_as_lag = diff <= 30*60,
group = with(rle(same_group_as_lag), rep(seq_along(lengths), lengths))) %>%
group_by(group, add = TRUE) %>%
summarise(start_timestamp = min(time_stamps),
end_startstamp = max(time_stamps))
#> # A tibble: 8 x 5
#> # Groups: customer, device [6]
#> customer device group start_timestamp end_startstamp
#> <fct> <fct> <int> <dttm> <dttm>
#> 1 XYZ x 1 2020-05-13 07:50:06 2020-05-13 08:05:06
#> 2 XYZ x 2 2020-05-13 08:50:06 2020-05-13 08:50:06
#> 3 XYZ y 1 2020-05-14 07:50:06 2020-05-14 08:25:06
#> 4 XYZ z 1 2020-05-16 09:50:06 2020-05-16 09:50:06
#> 5 XYZ a 1 2020-05-16 09:50:06 2020-05-16 09:50:06
#> 6 XYZ a 2 2020-05-16 19:50:06 2020-05-16 19:50:06
#> 7 XYZ b 1 2020-05-17 09:50:06 2020-05-17 10:15:06
#> 8 XYZ c 1 2020-05-13 07:50:06 2020-05-13 08:32:06
由代表 package (v0.3.0) 於 2020 年 6 月 25 日創建
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.