简体   繁体   English

如果行在第一个实例的某个时间内出现,则通过 R 中的组值删除行

[英]Removing rows if they occur within a certain time of first instance by a group value in R

This is a follow up question to Removing rows if they occur within a certain time of each other by a group value in R .这是对删除行的后续问题,如果它们在 R 中的组值彼此相隔一定时间内发生

I have modified the df presented there with a particular case I am trying to filter.我已经用我试图过滤的特定情况修改了那里提供的df

This is the code with a quick way to filter out rows with less than a 5 minutes interval using tidyverse (similar to the solution posted by PaulS in the original question).这是使用 tidyverse 快速过滤掉间隔小于 5 分钟的行的code (类似于PaulS在原始问题中发布的解决方案)。

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

df %>% 
  group_by(ID) %>% 
  mutate(d = abs(difftime(lag(Timestamp), Timestamp)),
         keep = is.na(d) | d > 5*60) %>% 
  filter(keep) %>% 
  select(-d, -keep) %>% 
  arrange(Row)

This is the result:这是结果:

# A tibble: 8 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B   

This is not what I want because the time difference is always obtained to the previous row (using lag ).这不是我想要的,因为总是获得前一行的时间差(使用lag )。 This means that Row 6 and Row 7 are removed because they are less than 5 minutes away from each other and also from Row 5 .这意味着第 6 行第 7 行被删除,因为它们彼此相距不到 5 分钟,也与第 5 行相距不到 5 分钟。 The truth is, though, that Row 6 is more than five minutes away from the first instance in group A .但事实是,第 6 行距离A组中的第一个实例超过5 分钟。 It should be kept and become the first instance for a new interval and then Row 7 would be removed accordingly to the distance to the new first instance in Row 6 .它应该被保留并成为新间隔的第一个实例,然后将根据与第 6 行中的新第一个实例的距离相应地删除第 7

What I have been unable to obtain (without a loop) is how to define this time interval groups of 5 minutes, not from successive rows, but from the first instance after 5 minues.我一直无法获得(没有循环)是如何定义这个 5 分钟的时间间隔组,而不是从连续的行,而是从 5 分钟后的第一个实例。

EDIT 1: Possible solution using an hybrid tidyverse-loop approach:编辑 1:使用混合 tidyverse-loop 方法的可能解决方案:

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

loop <- TRUE

while(loop) {
  
  df <- df %>% 
    group_by(ID) %>% 
    mutate(d = abs(difftime(lag(Timestamp), Timestamp, units="secs")),
           first = is.na(d) | d > 300,
           prev_first = lag(first),
           keep = first | is.na(d) | (prev_first & d > 300) | (!first & !prev_first))
  
  if (all(df$keep)) loop <- FALSE
  
  df <- df %>% filter(keep)
  
}

df <- df %>% 
  select(Row, Timestamp, ID) %>% 
  arrange(Row)

This is the real expected result:这是真正的预期结果:

# A tibble: 9 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    6 0020-06-29 12:33:30 A    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B 

EDIT 2: More clarifiations.编辑2:更多澄清。 This is the df ordered by ID and with a comment on each row on whether it should be kept or removed:这是按 ID 排序的df ,每行都有关于是否应该保留或删除的注释:

  Row Timestamp           ID   
<dbl> <chr>               <chr>
    2 0020-06-29 12:27:00 A  <-- Keep (first in interval)  
    4 0020-06-29 12:28:30 A  <-- Remove (<5 mins from Row 2)   
    5 0020-06-29 12:31:30 A  <-- Remove (<5 mins from Row 2)  
    6 0020-06-29 12:33:30 A  <-- Keep (first in interval, >5 mins from previous first Row 2)  
    7 0020-06-29 12:33:45 A  <-- Remove (<5 mins from Row 6)    
   11 0020-06-29 12:55:00 A  <-- Keep (first in interval, >5 mins from previous first Row 6)  
    1 0020-06-29 12:14:00 B  <-- Keep (first in interval)
    3 0020-06-29 12:27:22 B  <-- Keep (first in interval, >5 mins from previous first Row 1)
    8 0020-06-29 12:43:00 B  <-- Keep (first in interval, >5 mins from previous first Row 3)
   10 0020-06-29 12:45:00 B  <-- Remove (<5 mins from Row 8)
   13 0020-06-29 13:04:00 B  <-- Keep (first in interval, >5 mins from previous first Row 8)  
    9 0020-06-29 12:44:00 C  <-- Keep (first in interval)  
   12 0020-06-29 12:57:00 C  <-- Keep (first in interval, >5 mins from previous first Row 9)

Even if this solution is not terrible, do you know whether the while statement can be removed in any way?即使这个解决方案并不可怕,你知道是否可以通过任何方式删除 while 语句吗?

You can create a recursive function, such as this based on this answer :您可以根据此答案创建递归 function,例如:

f <- function(d, ind = 1) {
  ind.next <- first(which(difftime(d, d[ind], units = "mins") > 5))
  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d, ind.next)))
}

Then for each ID use slice for rows based on Timestamp with the custom function:然后对于每个ID ,使用基于Timestamp和自定义 function 的行slice

library(tidyverse)

df %>%
  mutate(Timestamp = as.POSIXct(Timestamp)) %>%
  group_by(ID) %>%
  slice(f(Timestamp))

Output Output

    Row Timestamp           ID   
  <dbl> <dttm>              <chr>
1     2 0020-06-29 12:27:00 A    
2     6 0020-06-29 12:33:30 A    
3    11 0020-06-29 12:55:00 A    
4     1 0020-06-29 12:14:00 B    
5     3 0020-06-29 12:27:22 B    
6     8 0020-06-29 12:43:00 B    
7    13 0020-06-29 13:04:00 B    
8     9 0020-06-29 12:44:00 C    
9    12 0020-06-29 12:57:00 C 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM