简体   繁体   中英

R data.table remove rows conditionally among groups

I have this example dataset and the actual has millions of rows, so I'd appreciate a data.table solution but also a tidyverse solution would be fine:

dat1 = data.frame(name = c("X1", "X1", "X1", "X2", "X2", "X2", "X2", "X2", "X2"), 
              year = c(2015,2016,2017,2015,2016,2016,2017,2017, 2018),
              choice = c("o","o","o","o","o","r","r","o","o")
)
dat1

The logic I need to apply is:

If for any name and year combination only choice "o" exists, retain the row with "o" .

If for any name and year combination choices "o" and "r" exist, retain row with "r" and drop row with "o" . I don't want to name name and year combinations.

Does this work:

library(dplyr)
dat1 %>% group_by(name ,year) %>% filter(all(choice == 'o' )|choice == 'r')
# A tibble: 7 x 3
# Groups:   name, year [7]
  name   year choice
  <chr> <dbl> <chr> 
1 X1     2015 o     
2 X1     2016 o     
3 X1     2017 o     
4 X2     2015 o     
5 X2     2016 r     
6 X2     2017 r     
7 X2     2018 o     
library(data.table)
setDT(dat1)
dat1[, .SD[all(choice == "o") | choice == "r",], by = .(name, year)]
#    name year choice
# 1:   X1 2015      o
# 2:   X1 2016      o
# 3:   X1 2017      o
# 4:   X2 2015      o
# 5:   X2 2016      r
# 6:   X2 2017      r
# 7:   X2 2018      o

(I generated this before looking at KarthikS's answer, but the logic and the results are identical.)

An option is also to convert the column to factor with levels specified in the custom order and then select the first levels after dropping the levels with droplevels

library(dplyr)
dat1 %>%
     group_by(name, year) %>%
     filter(choice %in% levels(droplevels(factor(choice, 
           levels = c('r', 'o'))))[1])
# A tibble: 7 x 3
# Groups:   name, year [7]
#  name   year choice
#  <chr> <dbl> <chr> 
#1 X1     2015 o     
#2 X1     2016 o     
#3 X1     2017 o     
#4 X2     2015 o     
#5 X2     2016 r     
#6 X2     2017 r     
#7 X2     2018 o     

An equivalent option with data.table is

library(data.table)
setDT(dat1)[dat1[, .I[choice %in% 
       levels(droplevels(factor(choice, 
           levels = c('r', 'o'))))[1]], .(name, year)]$V1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM