简体   繁体   English

(R,dplyr)如何聚合必须有条件地包含行的窗口数据?

[英](R, dplyr) How to aggregate-window data where rows must be conditionally included?

I've googled around, but have not found anything similar to this, but I'm hoping what I'm trying to do has already been done by someone else before.我已经用谷歌搜索了,但没有发现任何类似的东西,但我希望我正在尝试做的事情之前已经被其他人做过了。

  1. I have a set of data with timestamps.我有一组带有时间戳的数据。

  2. I need a running cumulative count of transactions per second - calculated as a true rolling second window.我需要每秒运行的事务累积计数 - 计算为真正的滚动秒 window。 Would be nice to just truncate / round off to nearest second but that wont be enough for my use case.将截断/四舍五入到最接近的秒会很好,但这对于我的用例来说还不够。

#Timestamp #时间戳 Current TPS当前 TPS
00:00:00.1 00:00:00.1 1 1 ................................................................................................ ..................................................... ..................................................
00:00:00.2 00:00:00.2 2 2
00:00:00.3 00:00:00.3 3 3
00:00:00.4 00:00:00.4 4 4
00:00:00.5 00:00:00.5 5 5
00:00:00.6 00:00:00.6 6 6
00:00:00.7 00:00:00.7 7 7
00:00:00.8 00:00:00.8 8 8
00:00:00.9 00:00:00.9 9 9
00:00:01.0 00:00:01.0 10 10 ....................................10 TPS here............................................................ .....................10 TPS 这里............ ..................................................................
00:00:01.1 00:00:01.1 10 10
00:00:01.2 00:00:01.2 10 10 .................................... still 10 TPS here............................................................ .................................. 仍然是 10 TPS ............ .....................................................
00:00:01.4 00:00:01.4 9 9 ............ only 9 here, because no event at 00:00:01.3 ......这里只有 9 个,因为 00:00:01.3 没有事件
00:00:01.5 00:00:01.5 9 9
00:00:01.5 00:00:01.5 10 10
00:00:01.8 00:00:01.8 8 8

Initially, I was planning to calculate a time interval difference between rows, but that doesn't solve the question of how to determine which rows should be included or excluded in the aggregate window.最初,我计划计算行之间的时间间隔差,但这并不能解决如何确定应该在聚合 window 中包含或排除哪些行的问题。

This morning, I thought about mutating a new column that is just the subsecond portion of time.今天早上,我想改变一个新的列,它只是时间的亚秒部分。 Then, I use that new column as a substraction on the time column, and cumsum it inside a 2nd if_else mutate that does a look-back over last X number of rows?然后,我使用该新列作为时间列的减法,并在第二个 if_else 变异中对其进行累积,该变异对最后 X 行进行回顾?

Does that sound reasonable?这听起来合理吗? Have I overlooked some other/better approach?我是否忽略了其他/更好的方法?

library(dplyr)

timestamps <- c("00:00:00.1", "00:00:00.2", "00:00:00.3", "00:00:00.4", "00:00:00.5", "00:00:00.6", "00:00:00.7", "00:00:00.8", "00:00:00.9", "00:00:01.0", "00:00:01.1", "00:00:01.2", "00:00:01.4", "00:00:01.5", "00:00:01.5", "00:00:01.8") %>%
  lubridate::hms %>%     # convert to a time period in hours minutes seconds
  as.numeric  # convert that to a number of seconds

slider::slide_index_dbl(timestamps,
            timestamps,
            ~length(.x),   # = how many timestamps are in the window
            .before = .99)  # Note: using 1 here gave me an incorrect result, 
            # presumably due to floating point arithmetic errors 
            # https://en.wikipedia.org/wiki/Floating-point_error_mitigation
[1]  1  2  3  4  5  6  7  8  9 10 10 10  9 10 10  8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM