简体   繁体   English

使用dplyr的组的不同过滤规则

[英]Different filter rules for groups using dplyr

Sample data: 样本数据:

df <- data.frame(loc.id = rep(1:2, each = 11), 
             x = c(35,51,68,79,86,90,92,93,95,98,100,35,51,68,79,86,90,92,92,93,94,94))

For each loc.id , I want to filter filter out x <= 95 . 对于每个loc.id ,我要过滤掉x <= 95

df %>% group_by(loc.id) %>% filter(row_number() <= which.max(x >= 95))

          loc.id   x
          <int> <dbl>
       1      1    35
       2      1    51
       3      1    68
       4      1    79
       5      1    86
       6      1    90
       7      1    92
       8      1    93
       9      1    95
      10      2    35

However, the issue for group 2 all the values are less than 95. Therefore I want to keep all values of x for group 2. However, the above line does not do it. 但是,第2组所有值的问题都小于95。因此,我想保留第2组x所有值。但是,上面的行没有这样做。

Perhaps something like this? 也许像这样?

df %>%
    group_by(loc.id) %>%
    mutate(n = sum(x > 95)) %>%
    filter(n == 0 | (x > 0 & x > 95)) %>%
    ungroup() %>%
    select(-n)
## A tibble: 13 x 2
#   loc.id     x
#    <int> <dbl>
# 1      1   98.
# 2      1  100.
# 3      2   35.
# 4      2   51.
# 5      2   68.
# 6      2   79.
# 7      2   86.
# 8      2   90.
# 9      2   92.
#10      2   92.
#11      2   93.
#12      2   94.
#13      2   94.

Note that removing entries where x <= 95 corresponds to retaining entries where x > 95 (not x >= 95 ). 请注意, 删除 x <= 95条目对应于保留 x > 95条目(不是x >= 95 )。

You can use match to get the first TRUE index and return the length of group if no match is found via the nomatch parameter: 如果没有通过nomatch参数找到匹配项,则可以使用match获取第一个TRUE索引并返回组的长度:

df %>% 
    group_by(loc.id) %>% 
    filter(row_number() <= match(TRUE, x >= 95, nomatch=n()))

# A tibble: 20 x 2
# Groups:   loc.id [2]
#   loc.id     x
#    <int> <dbl>
# 1      1    35
# 2      1    51
# 3      1    68
# 4      1    79
# 5      1    86
# 6      1    90
# 7      1    92
# 8      1    93
# 9      1    95
#10      2    35
#11      2    51
#12      2    68
#13      2    79
#14      2    86
#15      2    90
#16      2    92
#17      2    92
#18      2    93
#19      2    94
#20      2    94

Or reverse cumsum as filter condition: 或将cumsum取反作为过滤条件:

df %>% group_by(loc.id) %>% filter(!lag(cumsum(x >= 95), default=FALSE))

A solution using all along with dplyr package can be achieved as: 使用的溶液all连同dplyr封装能够被实现为:

library(dplyr)
df %>% group_by(loc.id) %>%
  filter((x > 95) | all(x<=95))  # All x in group are <= 95 OR x > 95

# # Groups: loc.id [2]
# loc.id     x
# <int> <dbl>
# 1      1  98.0
# 2      1 100  
# 3      2  35.0
# 4      2  51.0
# 5      2  68.0
# 6      2  79.0
# 7      2  86.0
# 8      2  90.0
# 9      2  92.0
# 10      2  92.0
# 11      2  93.0
# 12      2  94.0
# 13      2  94.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM