简体   繁体   English

R data.table对组大小的过滤

[英]R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f . 我试图在我的data.table找到所有记录,其中在字段f中存在多于一行的值v的记录

For instance, we can use this data: 例如,我们可以使用以下数据:

dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))

If looking for that property in field f2 , we'd get (note the absence of the (3,2) tuple) 如果在字段f2查找该属性,我们会得到(注意没有(3,2)元组)

    f1 f2
1:  1  1
2:  2  1
3:  4  3
4:  5  3  

My first guess was dt[.N>2,list(.N),by=f2] , but that actually keeps entries with .N==1 . 我的第一个猜测是dt[.N>2,list(.N),by=f2] ,但实际上保留了.N==1条目。

dt[.N>2,list(.N),by=f2]
   f2 N
1:  1 2
2:  2 1
3:  3 2

The other easy guess, dt[duplicated(dt$f2)] , doesn't do the trick, as it keeps one of the 'duplicates' out of the results. 另一个简单的猜测dt[duplicated(dt$f2)]并不能解决问题,因为它使结果中没有“重复项”。

dt[duplicated(dt$f2)]
   f1 f2
1:  2  1
2:  5  3

So how can I get this done? 那我该怎么做呢?

Edited to add example 编辑添加示例

The question is not clear. 问题尚不清楚。 Based on the title, it looks like we want to extract all groups with number of rows ( .N ) greater than 1. 根据标题,我们似乎要提取行数( .N )大于1的所有组。

DT[, if(.N>1) .SD, by=f]

But the value v in field f is making it confusing. 但是value v in field fvalue v in field f令人困惑。

If I understand what you're after correctly, you'll need to do some compound queries: 如果我正确理解了您的要求,则需要执行一些复合查询:

library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
#    v1 f
# 1:  1 1
# 2:  2 2
# 3:  3 3
# 4:  4 1
# 5:  5 2
# 6:  6 3
# 7:  7 1
# 8:  8 2
# 9:  9 3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM