简体   繁体   中英

Drop all rows from data frame that follow a filter threshold using dplyr

This feels like a common enough task that I assume there's an established function/method for accomplishing it. I'm imagining a function like dplyr::filter_after() but there doesn't seem to be one.

Here's the method I'm using as a starting point:

#Setup:
library(dplyr)
threshold <- 3
test.df <- data.frame("num"=c(1:5,1:5),"let"=letters[1:10])

#Drop every row that follows the first 3, including that row:
out.df <- test.df %>%
  mutate(pastThreshold = cumsum(num>=threshold)) %>%
  filter(pastThreshold==0) %>%
  dplyr::select(-pastThreshold)

This produces the desired output:

> out.df
  num let
1   1   a
2   2   b

Is there another solution that's less verbose?

You can do:

test.df %>%
 slice(1:which.max(num == threshold)-1)

  num let
1   1   a
2   2   b

We can use the same in filter without the need for creating extra column and later removing it

library(dplyr)
test.df %>% 
     filter(cumsum(num>=threshold) == 0)
#   num let
#1   1   a
#2   2   b

Or another option is match with slice

test.df  %>%
    slice(seq_len(match(threshold-1, num)))

Or another option is rleid

library(data.table)
test.df %>%
     filter(rleid(num >= threshold) == 1)

dplyr provides the window functions cumany and cumall , that filter all rows after/before a condition becomes false for the first time. Documentation .

test.df %>% 
  filter(cumall(num<threshold)) #all rows until condition violated for first time
#   num let
# 1   1   a
# 2   2   b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM