简体   繁体   English

使用R data.table获取满足条件的所有行

[英]Get all rows fulfilling a condition by group with R data.table

Say we have this toy data.table 假设我们有这个玩具data.table

prueba  <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)

id kk rrr
1 FA   1
1  N   2
1  N   3
1  N   4
2 NA   5
2 FA   6
3  N   7
3 FA   8
4  N   9

And we want to retrieve all rows pertaining of a given "id" if that id contains any "FA" value on the kk column. 如果该id包含kk列上的任何“FA”值,我们希望检索属于给定“id”的所有行。

I've got to do it in this way: 我必须这样做:

prueba[id %in% prueba[,any(kk=="FA", na.rm=T),
   by=id]$id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]

id kk rrr
1 FA   1
1  N   2
1  N   3
1  N   4
2 NA   5
2 FA   6
3  N   7
3 FA   8

(We get all rows with id=1,2 and 3). (我们获得id = 1,2和3的所有行)。

But I think it's too long and not optimized. 但我认为它太长而且没有优化。

How would you do it easily with data.table? 你会如何使用data.table轻松完成?

I'm not sure about optimized, but cleaned up and using dplyr: 我不确定优化,但清理和使用dplyr:

library(dplyr)
prueba %>% 
    group_by(id) %>% 
    filter('FA'%in%kk)

# A tibble: 8 x 3
# Groups:   id [3]
     id    kk   rrr
  <dbl> <chr> <int>
1     1    FA     1
2     1     N     2
3     1     N     3
4     1     N     4
5     2  <NA>     5
6     2    FA     6
7     3     N     7
8     3    FA     8

For a data.table case I would simplify your code to: 对于data.table案例,我会将您的代码简化为:

prueba  <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)  

prueba[id %in% unique(prueba[kk=="FA",id])]

The output is: 输出是:

   id kk rrr
1:  1 FA   1
2:  1  N   2
3:  1  N   3
4:  1  N   4
5:  2 NA   5
6:  2 FA   6
7:  3  N   7
8:  3 FA   8 

I've been trying the different solutions with microbenchmark: 我一直在尝试使用microbenchmark的不同解决方案:

prueba  <- data.table(id=rep(c(1,1,1,1,2,2,3,3,4),1000000), kk=rep(c("FA", "N","N","N",NA,"FA","N", "FA", "N"),1000000), rrr=rep(1:9),1000000)

prueba[, if(any(kk == "FA")) .SD, by= id]               # docendo
prueba[id %in% unique(prueba[kk == "FA", id])]          # lmo
prueba[id %in% prueba[, .I[kk == "FA"], by = id]$id,]   # eddi
prueba[id %in% prueba[,any(kk=="FA", na.rm=T),by=id]
   $id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]      # skan
prueba %>%   group_by(id) %>%   filter('FA'%in%kk)      # Andrew
prueba[prueba[kk == "FA", .(id)], on="id"]              # lmo

.

min       lq       mean     median       uq     max    name
2.206436 2.211022 2.258038 2.215607 2.283839 2.352071   docendo
1.456590 1.472334 1.596654 1.488077 1.666687 1.845296   lmo
2.767113 2.869260 2.953024 2.971408 3.045980 3.120552   eddi
3.431671 3.437914 3.451760 3.444157 3.461804 3.479451   skan
2.088516 2.247807 2.313196 2.407098 2.425535 2.443973   Andrew

The last solution by lmo doesn't work, it says: lmo的最后一个解决方案不起作用,它说:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. vecseq中的错误(f __,len __,if(allow.cartesian || notjoin ||!anyDuplicated(f __,:Join结果超过2 ^ 31行(内部vecseq达到物理限制)。很可能是错误指定的连接。检查重复键i中的值一遍又一遍地连接到x中的同一个组。如果没关系,请尝试= .EACHI为每个组运行j以避免大量分配。

I expected to see a much bigger difference between methods. 我希望看到方法之间有更大的差异。 Maybe with a different dataset. 也许使用不同的数据集。 The fastest method so far seems to be: 迄今为止最快的方法似乎是:

prueba[id %in% unique(prueba[kk == "FA", id])] 

I guess there must be better options using .I, .GRP or such functions. 我想必须有更好的选择使用.I,.GRP或这样的功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM