[英]Keeping only the x largest groups with data.table
I have recently started using the data.table package in R, but I recently stumbled into an issue that I do not know how to tackle with data.table. 我最近开始在R中使用data.table包,但我最近偶然发现了一个我不知道如何处理data.table的问题。
Sample data: 样本数据:
set.seed(1)
library(data.table)
dt = data.table(group=c("A","A","A","B","B","B","C","C"),value = runif(8))
I can add a group count with the statement 我可以在声明中添加组计数
dt[,groupcount := .N ,group]
but now I only want to keep the x groups with the largest value for groupcount
. 但现在我只想保持x组具有
groupcount
。 Let's assume x=1
for the example. 我们假设
x=1
作为例子。
I tried chaining as follows: 我尝试链接如下:
dt[,groupcount := .N ,group][groupcount %in% head(sort(unique(groupcount),decreasing=TRUE),1)]
But since group A and B both have three elements, they both remain in the data.table. 但由于A组和B组都有三个元素,它们都保留在data.table中。 I only want the x largest groups where x=1, so I only want one of the groups (A or B) to remain.
我只想要x = 1的x个最大的组,所以我只想要保留其中一个组(A或B)。 I assume this can be done in a single line with data.table.
我假设这可以使用data.table在一行中完成。 Is this true, and if yes, how?
这是真的,如果是的话,怎么样?
To clarify: x is an arbitrarily chosen number here. 澄清一下 : x是一个任意选择的数字。 The function should also work with x=3, where it would return the 3 largest groups.
该函数也应该与x = 3一起使用,它将返回3个最大的组。
Here is a method that uses a join. 这是一个使用连接的方法。
x <- 1
dt[dt[, .N, by=group][order(-N)[1:x]], on="group"]
group value N
1: A 0.2655087 3
2: A 0.3721239 3
3: A 0.5728534 3
The inner data.frame is aggregated to count the observations and the position of the x largest groups is retrieved using order
subset using the value of x. 聚合内部data.frame以计算观察值,并使用x的值使用
order
子集检索x个最大组的位置。 The resulting data frame is then joined onto the original by group. 然后,生成的数据框将按组连接到原始数据框。
We can do 我们可以做的
x <- 1
dt[dt[, {tbl <- table(group)
nm <- names(tbl)[tbl==max(tbl)]
if(length(nm) < x) rep(TRUE, .N)
else group %in% sample(names(tbl)[tbl==max(tbl)], x)}]]
How about making use of the order of the groupcount
如何使用
groupcount
的顺序
setorder(dt, -groupcount)
x <- 1
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
x <- 3
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
# 4: B 0.9082078 3
# 5: B 0.2016819 3
# 6: B 0.8983897 3
# 7: C 0.9446753 2
# 8: C 0.6607978 2
## alternative syntax
# dt[group %in% unique(dt$group)[1:x] ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.