I have a dataset with rows being individual observations of year and stage, and where there can be zero to multiple observations of a given stage in a given year:
df <- data.frame(year = c(2000, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2003, 2003, 2003),
stage = c("a", "a", "a", "b", "b", "b",
"b", "a", "b", "a", "a", "a"))
df
## year stage
## 1 2000 a
## 2 2000 a
## 3 2000 a
## 4 2000 b
## 5 2001 b
## 6 2001 b
## 7 2001 b
## 8 2002 a
## 9 2002 b
## 10 2003 a
## 11 2003 a
## 12 2003 a
I want filter the data to select only the years for which there are observations of both stages a and b (in this case years 2000 and 2002). I have figured out the following way to do this with dplyr
and tidyr
:
library(dplyr)
library(tidyr)
yrs <- df %>%
group_by(year, stage) %>%
summarise(n = n()) %>%
spread(stage, -year) %>%
na.omit %>%
pull(year)
yrs
## [1] 2000 2002
filter(df, year %in% yrs)
## year stage
## 1 2000 a
## 2 2000 a
## 3 2000 a
## 4 2000 b
## 5 2002 a
## 6 2002 b
This seems a bit clunky and might not scale up well for very large datasets. Is there any simpler, more straightforward way to subset these years using dplyr
without calling tidyr::spread
?
You can use group_by %>% filter
; For each group, use all(c('a', 'b') %in% stage)
to check if both a
and b
are within the stage column, and filter the group based on it:
df %>% group_by(year) %>% filter(all(c('a', 'b') %in% stage))
# A tibble: 6 x 2
# Groups: year [2]
# year stage
# <dbl> <fctr>
#1 2000 a
#2 2000 a
#3 2000 a
#4 2000 b
#5 2002 a
#6 2002 b
Maybe this will work for you:
df %>% group_by(year) %>%
filter(length(unique(stage)) == 2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.