如果行在第一个实例的某个时间内出现，则通过 R 中的组值删除行

Question

This is a follow up question to Removing rows if they occur within a certain time of each other by a group value in R .这是对删除行的后续问题，如果它们在 R 中的组值彼此相隔一定时间内发生。

I have modified the df presented there with a particular case I am trying to filter.我已经用我试图过滤的特定情况修改了那里提供的df 。

This is the code with a quick way to filter out rows with less than a 5 minutes interval using tidyverse (similar to the solution posted by PaulS in the original question).这是使用 tidyverse 快速过滤掉间隔小于 5 分钟的行的code （类似于PaulS在原始问题中发布的解决方案）。

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

df %>% 
  group_by(ID) %>% 
  mutate(d = abs(difftime(lag(Timestamp), Timestamp)),
         keep = is.na(d) | d > 5*60) %>% 
  filter(keep) %>% 
  select(-d, -keep) %>% 
  arrange(Row)

This is the result:这是结果：

# A tibble: 8 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B

This is not what I want because the time difference is always obtained to the previous row (using lag ).这不是我想要的，因为总是获得前一行的时间差（使用lag ）。 This means that Row 6 and Row 7 are removed because they are less than 5 minutes away from each other and also from Row 5 .这意味着第 6 行和第 7 行被删除，因为它们彼此相距不到 5 分钟，也与第 5 行相距不到 5 分钟。 The truth is, though, that Row 6 is more than five minutes away from the first instance in group A .但事实是，第 6 行距离A组中的第一个实例超过5 分钟。 It should be kept and become the first instance for a new interval and then Row 7 would be removed accordingly to the distance to the new first instance in Row 6 .它应该被保留并成为新间隔的第一个实例，然后将根据与第 6 行中的新第一个实例的距离相应地删除第 7行。

What I have been unable to obtain (without a loop) is how to define this time interval groups of 5 minutes, not from successive rows, but from the first instance after 5 minues.我一直无法获得（没有循环）是如何定义这个 5 分钟的时间间隔组，而不是从连续的行，而是从 5 分钟后的第一个实例。

EDIT 1: Possible solution using an hybrid tidyverse-loop approach:编辑 1：使用混合 tidyverse-loop 方法的可能解决方案：

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

loop <- TRUE

while(loop) {
  
  df <- df %>% 
    group_by(ID) %>% 
    mutate(d = abs(difftime(lag(Timestamp), Timestamp, units="secs")),
           first = is.na(d) | d > 300,
           prev_first = lag(first),
           keep = first | is.na(d) | (prev_first & d > 300) | (!first & !prev_first))
  
  if (all(df$keep)) loop <- FALSE
  
  df <- df %>% filter(keep)
  
}

df <- df %>% 
  select(Row, Timestamp, ID) %>% 
  arrange(Row)

This is the real expected result:这是真正的预期结果：

# A tibble: 9 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    6 0020-06-29 12:33:30 A    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B

EDIT 2: More clarifiations.编辑2：更多澄清。 This is the df ordered by ID and with a comment on each row on whether it should be kept or removed:这是按 ID 排序的df ，每行都有关于是否应该保留或删除的注释：

  Row Timestamp           ID   
<dbl> <chr>               <chr>
    2 0020-06-29 12:27:00 A  <-- Keep (first in interval)  
    4 0020-06-29 12:28:30 A  <-- Remove (<5 mins from Row 2)   
    5 0020-06-29 12:31:30 A  <-- Remove (<5 mins from Row 2)  
    6 0020-06-29 12:33:30 A  <-- Keep (first in interval, >5 mins from previous first Row 2)  
    7 0020-06-29 12:33:45 A  <-- Remove (<5 mins from Row 6)    
   11 0020-06-29 12:55:00 A  <-- Keep (first in interval, >5 mins from previous first Row 6)  
    1 0020-06-29 12:14:00 B  <-- Keep (first in interval)
    3 0020-06-29 12:27:22 B  <-- Keep (first in interval, >5 mins from previous first Row 1)
    8 0020-06-29 12:43:00 B  <-- Keep (first in interval, >5 mins from previous first Row 3)
   10 0020-06-29 12:45:00 B  <-- Remove (<5 mins from Row 8)
   13 0020-06-29 13:04:00 B  <-- Keep (first in interval, >5 mins from previous first Row 8)  
    9 0020-06-29 12:44:00 C  <-- Keep (first in interval)  
   12 0020-06-29 12:57:00 C  <-- Keep (first in interval, >5 mins from previous first Row 9)

Even if this solution is not terrible, do you know whether the while statement can be removed in any way?即使这个解决方案并不可怕，你知道是否可以通过任何方式删除 while 语句吗？

Answer 1

You can create a recursive function, such as this based on this answer :您可以根据此答案创建递归 function，例如：

f <- function(d, ind = 1) {
  ind.next <- first(which(difftime(d, d[ind], units = "mins") > 5))
  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d, ind.next)))
}

Then for each ID use slice for rows based on Timestamp with the custom function:然后对于每个ID ，使用基于Timestamp和自定义 function 的行slice ：

library(tidyverse)

df %>%
  mutate(Timestamp = as.POSIXct(Timestamp)) %>%
  group_by(ID) %>%
  slice(f(Timestamp))

Output Output

    Row Timestamp           ID   
  <dbl> <dttm>              <chr>
1     2 0020-06-29 12:27:00 A    
2     6 0020-06-29 12:33:30 A    
3    11 0020-06-29 12:55:00 A    
4     1 0020-06-29 12:14:00 B    
5     3 0020-06-29 12:27:22 B    
6     8 0020-06-29 12:43:00 B    
7    13 0020-06-29 13:04:00 B    
8     9 0020-06-29 12:44:00 C    
9    12 0020-06-29 12:57:00 C

如果行在第一个实例的某个时间内出现，则通过 R 中的组值删除行

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-09-04 13:48:27

如果行在第一个实例的某个时间内出现，则通过 R 中的组值删除行

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-09-04 13:48:27

解决方案1
1 已采纳 2022-09-04 13:48:27