简体   繁体   English

R data.table 在组之间有条件地删除行

[英]R data.table remove rows conditionally among groups

I have this example dataset and the actual has millions of rows, so I'd appreciate a data.table solution but also a tidyverse solution would be fine:我有这个示例数据集,实际有数百万行,所以我很欣赏data.table解决方案,但也可以使用tidyverse解决方案:

dat1 = data.frame(name = c("X1", "X1", "X1", "X2", "X2", "X2", "X2", "X2", "X2"), 
              year = c(2015,2016,2017,2015,2016,2016,2017,2017, 2018),
              choice = c("o","o","o","o","o","r","r","o","o")
)
dat1

The logic I need to apply is:我需要应用的逻辑是:

If for any name and year combination only choice "o" exists, retain the row with "o" .如果对于任何名称和年份组合,只有选择"o"存在,则保留带有"o"的行。

If for any name and year combination choices "o" and "r" exist, retain row with "r" and drop row with "o" .如果存在任何名称和年份组合选项"o""r" ,则使用"r"保留行并使用"o"删除行。 I don't want to name name and year combinations.我不想命名nameyear组合。

Does this work:这是否有效:

library(dplyr)
dat1 %>% group_by(name ,year) %>% filter(all(choice == 'o' )|choice == 'r')
# A tibble: 7 x 3
# Groups:   name, year [7]
  name   year choice
  <chr> <dbl> <chr> 
1 X1     2015 o     
2 X1     2016 o     
3 X1     2017 o     
4 X2     2015 o     
5 X2     2016 r     
6 X2     2017 r     
7 X2     2018 o     
library(data.table)
setDT(dat1)
dat1[, .SD[all(choice == "o") | choice == "r",], by = .(name, year)]
#    name year choice
# 1:   X1 2015      o
# 2:   X1 2016      o
# 3:   X1 2017      o
# 4:   X2 2015      o
# 5:   X2 2016      r
# 6:   X2 2017      r
# 7:   X2 2018      o

(I generated this before looking at KarthikS's answer, but the logic and the results are identical.) (我在查看 KarthikS 的答案之前生成了这个,但逻辑和结果是相同的。)

An option is also to convert the column to factor with levels specified in the custom order and then select the first levels after dropping the levels with droplevels一个选项还是将列转换为具有自定义顺序中指定levelsfactor ,然后在使用droplevels删除级别后 select first levels

library(dplyr)
dat1 %>%
     group_by(name, year) %>%
     filter(choice %in% levels(droplevels(factor(choice, 
           levels = c('r', 'o'))))[1])
# A tibble: 7 x 3
# Groups:   name, year [7]
#  name   year choice
#  <chr> <dbl> <chr> 
#1 X1     2015 o     
#2 X1     2016 o     
#3 X1     2017 o     
#4 X2     2015 o     
#5 X2     2016 r     
#6 X2     2017 r     
#7 X2     2018 o     

An equivalent option with data.table is data.table的等效选项是

library(data.table)
setDT(dat1)[dat1[, .I[choice %in% 
       levels(droplevels(factor(choice, 
           levels = c('r', 'o'))))[1]], .(name, year)]$V1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM