简体   繁体   English

使用dplyr使用变量中所有类别的观察值过滤年份

[英]Filter years with observations of all categories in a variable using dplyr

I have a dataset with rows being individual observations of year and stage, and where there can be zero to multiple observations of a given stage in a given year: 我有一个数据集,其中的行是年份和阶段的单个观测值,并且在给定年份中某个给定阶段的观测值可以为零到多个:

df <- data.frame(year = c(2000, 2000, 2000, 2000, 2001, 2001, 
                          2001, 2002, 2002, 2003, 2003, 2003),
                 stage = c("a", "a", "a", "b", "b", "b",
                           "b", "a", "b", "a", "a", "a")) 
df
##    year stage
## 1  2000     a
## 2  2000     a
## 3  2000     a
## 4  2000     b
## 5  2001     b
## 6  2001     b
## 7  2001     b
## 8  2002     a
## 9  2002     b
## 10 2003     a
## 11 2003     a
## 12 2003     a

I want filter the data to select only the years for which there are observations of both stages a and b (in this case years 2000 and 2002). 我想过滤数据以仅选择同时观察到a和b阶段的年份(在这种情况下为2000和2002)。 I have figured out the following way to do this with dplyr and tidyr : 我想出了用dplyrtidyr做到这一点的以下方法:

library(dplyr) 
library(tidyr) 

yrs <- df %>% 
  group_by(year, stage) %>%
  summarise(n = n()) %>%
  spread(stage, -year) %>% 
  na.omit %>% 
  pull(year) 

yrs
## [1] 2000 2002

filter(df, year %in% yrs)
##   year stage
## 1 2000     a
## 2 2000     a
## 3 2000     a
## 4 2000     b
## 5 2002     a
## 6 2002     b

This seems a bit clunky and might not scale up well for very large datasets. 这似乎有些笨拙,并且对于非常大的数据集可能无法很好地扩展。 Is there any simpler, more straightforward way to subset these years using dplyr without calling tidyr::spread ? 有什么更简单,更直接的方法可以在不调用tidyr::spread情况下使用dplyr进行子集化?

You can use group_by %>% filter ; 您可以使用group_by %>% filter ; For each group, use all(c('a', 'b') %in% stage) to check if both a and b are within the stage column, and filter the group based on it: 对于每个组,使用all(c('a', 'b') %in% stage)检查ab是否都 stage列中,并根据其过滤该组:

df %>% group_by(year) %>% filter(all(c('a', 'b') %in% stage))

# A tibble: 6 x 2
# Groups:   year [2]
#   year  stage
#  <dbl> <fctr>
#1  2000      a
#2  2000      a
#3  2000      a
#4  2000      b
#5  2002      a
#6  2002      b

Maybe this will work for you: 也许这将为您工作:

df %>% group_by(year) %>% 
       filter(length(unique(stage)) == 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM