R：计算时间间隔内的行数

Question

let's assume the following dataframe:让我们假设以下数据框：

df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
         times = c(as.POSIXct("2021-01-02 10:00:00"),
                   as.POSIXct("2021-01-02 11:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 14:00:00"),
                   as.POSIXct("2021-01-02 15:00:00")))
        ID times              
  <dbl> <dttm>             
1    12 2021-01-02 10:00:00
2    12 2021-01-02 11:00:00
3    12 2021-01-02 13:00:00
4    13 2021-01-02 13:00:00
5    13 2021-01-02 14:00:00
6    13 2021-01-02 15:00:00

What I want is a column, that considers each timestamp of an ID as start value and computes the number of subsequent observation with the next 2h.我想要的是一个列，它将 ID 的每个时间戳视为起始值，并计算接下来 2 小时的后续观察次数。 So this is my goal:所以这是我的目标：

     ID times               n_obs_within_2h
  <dbl> <dttm>                        <dbl>
1    12 2021-01-02 10:00:00               2
2    12 2021-01-02 11:00:00               2
3    12 2021-01-02 13:00:00               1
4    13 2021-01-02 13:00:00               3
5    13 2021-01-02 14:00:00               2
6    13 2021-01-02 15:00:00               1

I know that this could be easily done with purrr::map by iterating over each row.我知道这可以通过purrr::map通过遍历每一行轻松完成。 However, my original dataset it quite big which makes it fairly unefficient to do so.但是，我的原始数据集非常大，因此这样做效率很低。 Can you think of another way than iterating over each row to achieve the computing n_obs_within_2h除了遍历每一行来实现计算n_obs_within_2h之外，你能想出另一种方法吗

EDIT: my current attempt:编辑：我目前的尝试：

df %>% group_by(ID) %>% 
  mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                         .f = function(i, t, data) {
                                           n <- data %>%
                                             filter(ID == i) %>%
                                             filter(between(as.double.difftime(times-t, units = "hours"),
                                                            0, 2)) %>%
                                             nrow()
                                           return(n)
                                         }, data = .))

Answer 1

Maybe a vectorised approach using a sliding window to count subsequent observations within the next 2 hours?也许是使用滑动窗口计算未来 2 小时内的后续观察的矢量化方法？

library(tidyverse)
library(lubridate)
library(slider)

df <- tibble(
  ID = c(12, 12, 12, 13, 13, 13),
  times = c(
    as.POSIXct("2021-01-02 10:00:00"),
    as.POSIXct("2021-01-02 11:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 14:00:00"),
    as.POSIXct("2021-01-02 15:00:00")
  )
)

df |>
  group_by(ID) |>
  mutate(
    diff = difftime(times, min(times), units = "hours"),
    within_2 = if_else(diff <= 2, 1, 0),
    n_obs_within_2h = slide_dbl(within_2, sum, .after = Inf)
  ) |> 
  ungroup()

#> # A tibble: 6 × 5
#>      ID times               diff    within_2 n_obs_within_2h
#>   <dbl> <dttm>              <drtn>     <dbl>           <dbl>
#> 1    12 2021-01-02 10:00:00 0 hours        1               2
#> 2    12 2021-01-02 11:00:00 1 hours        1               1
#> 3    12 2021-01-02 13:00:00 3 hours        0               0
#> 4    13 2021-01-02 13:00:00 0 hours        1               3
#> 5    13 2021-01-02 14:00:00 1 hours        1               2
#> 6    13 2021-01-02 15:00:00 2 hours        1               1

^{Created on 2022-06-30 by the reprex package (v2.0.1)}^{由reprex 包于 2022-06-30 创建 (v2.0.1)}

Answer 2

Using another approach within the mapping is probably more likely to be the key to better performance in this case.在这种情况下，在映射中使用另一种方法可能更可能是提高性能的关键。 Instead of using filtering on the full data, we could utilize the grouped structure itself like this :我们可以像这样利用分组结构本身，而不是对完整数据使用过滤：

df |>
  group_by(ID) %>%
  mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) %>%
  ungroup() 

# A tibble: 6 × 3
#     ID times               n_obs_with_2h
#  <dbl> <dttm>                      <dbl>
#     12 2021-01-02 10:00:00             2
#     12 2021-01-02 11:00:00             2
#     12 2021-01-02 13:00:00             1
#     13 2021-01-02 13:00:00             3
#     13 2021-01-02 14:00:00             2
#     13 2021-01-02 15:00:00             1

See benchmark: (Even if the provided data is too small for this to be reliable. That being said I would expect it to be even faster on a bigger set)查看基准：（即使提供的数据太小而无法可靠。话虽如此，我希望它在更大的集合上更快）

fun_original <- function(df) {
  
  df %>% group_by(ID) %>% 
    mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                           .f = function(i, t, data) {
                                             n <- data %>%
                                               filter(ID == i) %>%
                                               filter(between(as.double.difftime(times-t, units = "hours"),
                                                              0, 2)) %>%
                                               nrow()
                                             return(n)
                                           }, data = .)) %>% ungroup()
  
}

fun_new <- function(df) {
  
  df |>
    group_by(ID) |>
    mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) |>
    ungroup() 
  
}

bench::mark(fun_original(df), fun_new(df))

# A tibble: 2 × 13
#  expression            min   median `itr/sec` mem_alloc #`gc/sec` n_itr  n_gc total_time result  
#  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  
#  fun_original(df)  15.53ms  16.38ms      59.9   45.77KB     15.6    23     6      384ms <tibble>
#  fun_new(df)        1.74ms   1.95ms     486.     6.02KB     10.8   224     5      461ms <tibble>

R：计算时间间隔内的行数

问题描述

2 个解决方案

解决方案1
0 2022-06-30 14:51:16

解决方案2
0 2022-06-30 16:16:08

R：计算时间间隔内的行数

问题描述

2 个解决方案

解决方案1 0 2022-06-30 14:51:16

解决方案2 0 2022-06-30 16:16:08

解决方案1
0 2022-06-30 14:51:16

解决方案2
0 2022-06-30 16:16:08