仅使用data.table保留x个最大的组

Question

I have recently started using the data.table package in R, but I recently stumbled into an issue that I do not know how to tackle with data.table. 我最近开始在R中使用data.table包，但我最近偶然发现了一个我不知道如何处理data.table的问题。

Sample data: 样本数据：

set.seed(1)
library(data.table)
dt = data.table(group=c("A","A","A","B","B","B","C","C"),value = runif(8))

I can add a group count with the statement 我可以在声明中添加组计数

dt[,groupcount := .N ,group]

but now I only want to keep the x groups with the largest value for groupcount . 但现在我只想保持x组具有groupcount 。 Let's assume x=1 for the example. 我们假设x=1 作为例子。

I tried chaining as follows: 我尝试链接如下：

dt[,groupcount := .N ,group][groupcount %in% head(sort(unique(groupcount),decreasing=TRUE),1)]

But since group A and B both have three elements, they both remain in the data.table. 但由于A组和B组都有三个元素，它们都保留在data.table中。 I only want the x largest groups where x=1, so I only want one of the groups (A or B) to remain. 我只想要x = 1的x个最大的组，所以我只想要保留其中一个组（A或B）。 I assume this can be done in a single line with data.table. 我假设这可以使用data.table在一行中完成。 Is this true, and if yes, how? 这是真的，如果是的话，怎么样？

To clarify: x is an arbitrarily chosen number here. 澄清一下： x是一个任意选择的数字。 The function should also work with x=3, where it would return the 3 largest groups. 该函数也应该与x = 3一起使用，它将返回3个最大的组。

Answer 1

Here is a method that uses a join. 这是一个使用连接的方法。

x <- 1

dt[dt[, .N, by=group][order(-N)[1:x]], on="group"]
   group     value N
1:     A 0.2655087 3
2:     A 0.3721239 3
3:     A 0.5728534 3

The inner data.frame is aggregated to count the observations and the position of the x largest groups is retrieved using order subset using the value of x. 聚合内部data.frame以计算观察值，并使用x的值使用order子集检索x个最大组的位置。 The resulting data frame is then joined onto the original by group. 然后，生成的数据框将按组连接到原始数据框。

Answer 2

We can do 我们可以做的

x <- 1
dt[dt[, {tbl <- table(group)
         nm <- names(tbl)[tbl==max(tbl)]
        if(length(nm) < x) rep(TRUE, .N)
        else group %in% sample(names(tbl)[tbl==max(tbl)], x)}]]

Answer 3

How about making use of the order of the groupcount 如何使用groupcount的顺序

setorder(dt, -groupcount)

x <- 1   
dt[group %in% dt[ , unique(group)][1:x] ]

#   group     value groupcount
# 1:     A 0.2655087          3
# 2:     A 0.3721239          3
# 3:     A 0.5728534          3


x <- 3
dt[group %in% dt[ , unique(group)][1:x] ]


#     group     value groupcount
# 1:     A 0.2655087          3
# 2:     A 0.3721239          3
# 3:     A 0.5728534          3
# 4:     B 0.9082078          3
# 5:     B 0.2016819          3
# 6:     B 0.8983897          3
# 7:     C 0.9446753          2
# 8:     C 0.6607978          2

## alternative syntax
# dt[group %in% unique(dt$group)[1:x] ]

仅使用data.table保留x个最大的组

问题描述

3 个解决方案

解决方案1
3 2017-07-28 11:51:11

解决方案2
2 2017-07-28 07:56:00

解决方案3
2 已采纳 2017-07-28 08:25:21

仅使用data.table保留x个最大的组

问题描述

3 个解决方案

解决方案1 3 2017-07-28 11:51:11

解决方案2 2 2017-07-28 07:56:00

解决方案3 2 已采纳 2017-07-28 08:25:21

解决方案1
3 2017-07-28 11:51:11

解决方案2
2 2017-07-28 07:56:00

解决方案3
2 已采纳 2017-07-28 08:25:21