简体   繁体   中英

Remove rows based on first instance to meet a condition

In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID , that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time .

ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)

data
   ID Time   Var
1   A    3 FALSE
2   B    3 FALSE
3   C    3 FALSE
4   A    6  TRUE
5   B    6  TRUE
6   C    6 FALSE
7   A    9  TRUE
8   B    9  TRUE
9   C    9 FALSE
10  A   12  TRUE
11  B   12 FALSE
12  C   12  TRUE

The desired result for this data frame should be:

 ID Time   Var
  A    3 FALSE
  B    3 FALSE
  C    3 FALSE
  C    6 FALSE
  C    9 FALSE

Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time ) another instance where Var == TRUE for that ID .

I've tried many different things but can't seem to figure this out. Any help is much appreciated!

Here's how to do that with dplyr using group_by and cumsum .

The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.

library(dplyr)
data%>%
  group_by(ID)%>%
  filter(cumsum(Var)<1)

      ID  Time   Var
  <fctr> <dbl> <lgl>
1      A     3 FALSE
2      B     3 FALSE
3      C     3 FALSE
4      C     6 FALSE
5      C     9 FALSE

Here's the equivalent code with data.table :

library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
   ID Time   Var
1:  A    3 FALSE
2:  B    3 FALSE
3:  C    3 FALSE
4:  C    6 FALSE
5:  C    9 FALSE

This data.table solution should work.

library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
   ID Time   Var
1:  A    3 FALSE
2:  B    3 FALSE
3:  C    3 FALSE
4:  C    6 FALSE
5:  C    9 FALSE

Given that you want all the values up to the first TRUE value, which.max is the way to go.

You can do this with the cumall verb as well:

library(dplyr)

data %>% 
  dplyr::group_by(ID) %>% 
  dplyr::filter(dplyr::cumall(!Var))

  ID     Time Var  
  <chr> <dbl> <lgl>
1 A         3 FALSE
2 B         3 FALSE
3 C         3 FALSE
4 C         6 FALSE
5 C         9 FALSE

cumall(!x): all cases until the first TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM