简体   繁体   English

组为空时data.table中的子集

[英]subset in data.table when groups are empty

For these data 对于这些数据

library(data.table)
set.seed(42)
dat <- data.table(id=1:12, group=rep(1:3, each=4), x=rnorm(12))

> dat
    id group           x
 1:  1     1  1.37095845
 2:  2     1 -0.56469817
 3:  3     1  0.36312841
 4:  4     1  0.63286260
 5:  5     2  0.40426832
 6:  6     2 -0.10612452
 7:  7     2  1.51152200
 8:  8     2 -0.09465904
 9:  9     3  2.01842371
10: 10     3 -0.06271410
11: 11     3  1.30486965
12: 12     3  2.28664539

My goal is to get, from each group, the first id for which x is larger than some threshold, say x>1.5 . 我的目标是从每个组中获取x大于某个阈值(例如x>1.5的第一个id。

> dat[x>1.5, .SD[1], by=group]
   group id        x
1:     2  7 1.511522
2:     3  9 2.018424

is indeed correct but I am unhappy about that fact that it silently yields no result for group 1. Instead, I would like it to yield the last id of each group for which no id fulfills the condition. 的确是正确的,但我对此感到不满,因为它对第1组静默地没有任何结果。相反,我希望它产生每个id都不满足条件的每个组的最后一个id。 I see that I could achieve this in two steps 我看到我可以分两步实现

> tmp <- dat[x>1.5, .SD[1], by=group]
> rbind(tmp,dat[!group%in%tmp$group,.SD[.N], by=group])
   group id         x
1:     2  7 1.5115220
2:     3  9 2.0184237
3:     1  4 0.6328626

but I am sure I am not making full use of the data.table capabilities here, which must permit a more elegant solution. 但是我确定我不会在这里充分利用data.table功能,这必须允许使用更优雅的解决方案。

Using data.table , we can check for a condition and subset row by group. 使用data.table ,我们可以按组检查条件和子集。

library(data.table)
dat[dat[, if(any(x>1.5)) .I[which.max(x > 1.5)] else .I[.N], by=group]$V1]

#   id group         x
#1:  4     1 0.6328626
#2:  7     2 1.5115220
#3:  9     3 2.0184237

The dplyr , translation of that would be dplyr翻译是

library(dplyr)
dat %>%
  group_by(group) %>%
  slice(if(any(x > 1.5)) which.max(x > 1.5) else n())

Or more efficiently 或更有效

dat[, .SD[{temp = x > 1.5; if (any(temp)) which.max(temp) else .N}], by = group]

Thanks to @IceCreamTouCan, @sindri_baldur and @jangorecki for their valuable suggestions to improve this answer. 感谢@ IceCreamTouCan,@ sindri_baldur和@jangorecki为改进此答案提供了宝贵建议。

You could subset both ways (which are optimized by GForce) and then combine them: 您可以将两种方式都进行子集化(由GForce优化),然后将它们结合起来:

D1 = dat[x>1.5, lapply(.SD, first), by=group]
D2 = dat[, lapply(.SD, last), by=group]
rbind(D1, D2[!D1, on=.(group)])

   group id         x
1:     2  7 1.5115220
2:     3  9 2.0184237
3:     1  4 0.6328626

There is some inefficiency here since we are grouping by group three times. 由于我们按group分组了三遍,所以这里效率低下。 I'm not sure if that will be outweighed by more efficient calculations in j thanks to GForce or not. 由于GForce,我不确定用j更有效的计算是否可以抵消这一损失。 @jangorecki points out that the inefficiency of grouping three times might be mitigated by setting the key first. @jangorecki指出,可以通过先设置密钥来减轻三遍分组的效率。

Comment : I used last(.SD) since .SD[.N] is not yet optimized and last(.SD) throws an error. 评论 :我使用了last(.SD),因为.SD [.N]尚未优化,并且last(.SD)引发错误。 I changed the OP's code to use lapply first for the sake of symmetry. 为了对称起见,我将OP的代码更改为首先使用lapply。

另一个选择是:

dat[x>1.5 | group!=shift(group, -1L), .SD[1L], .(group)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM