简体   繁体   English

对 data.table 中每个组中的随机行进行采样

[英]Sample random rows within each group in a data.table

How would you use data.table to efficiently take a sample of rows within each group in a data frame?您将如何使用 data.table 有效地对数据框中每个组中的行进行采样?

DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT
    a   b
 1: 2 562
 2: 1 183
 3: 2 180
 4: 1 874
 5: 2 533
 6: 1  21
 7: 2  57
 8: 1  20
 9: 2  39
10: 1 948
11: 2 799
12: 1 893
13: 2 993
14: 1  69
15: 2 906
16: 1 347
17: 2 969
18: 1 130
19: 2 118
20: 1 732

I was thinking of something like: DT[ , sample(??, 3), by = a] that would return a sample of three rows for each "a" (the order of the returned rows isn't significant):我在想这样的事情: DT[ , sample(??, 3), by = a]它将为每个“a”返回一个三行的样本(返回的行的顺序不重要):

    a   b
 1: 2 180
 2: 2  57
 3: 2 799
 4: 1  69
 5: 1 347
 6: 1 732

Maybe something like this?也许是这样的?

> DT[,.SD[sample(.N, min(3,.N))],by = a]
   a   b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343

(Thanks to Josh for the correction, below.) (感谢乔希的更正,如下。)

I believe joran's answer can be further generalized.我相信乔兰的回答可以进一步概括。 The details are here ( How do you sample groups in a data.table with a caveat ) but I believe this solution accounts for cases where there aren't "3" rows to sample from.详细信息在这里( How do you sample groups in a data.table with a警告)但我相信这个解决方案解释了没有“3”行可供采样的情况。

The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values.当前解决方案在尝试从公共值小于“x”的行中采样“x”次时会出错。 In the below case, x=3.在下面的情况下,x=3。 And it takes into consideration this caveat.它考虑到了这个警告。 (Solution done by nrussell) (由 nrussell 完成的解决方案)

set.seed(123)
##
DT <- data.table(
  a=c(1,1,1,1:15,1,1), 
  b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42

Inspired by this answer by David Arenburg , another method to avoid the .SD allocation would be to sample the groups, then join back onto the original data using .EACHI受到David Arenburg 的回答的启发,另一种避免.SD分配的方法是对组进行采样,然后使用.EACHI连接回原始数据

DT[ DT[, sample(.N, 3), by=a], b[i.V1], on="a", by=.EACHI]

#    a  V1
# 1: 2  42
# 2: 2 498
# 3: 2 179
# 4: 1 469
# 5: 1  93
# 6: 1 898

where the DT[, sample(.N, 3), by=a] line gives us a sample for each group其中DT[, sample(.N, 3), by=a]行为我们提供了每个组的样本

#         a V1
# 1:      1  9
# 2:      1  3
# 3:      1  2
# 4:      2  4
# 5:      2  9
# ---          

so we can then use V1 to give us the b it corresponds to.所以我们可以使用V1给我们它对应的b

Stratified sampling > oversampling分层抽样>过抽样

size=don[y==1,.(strata=length(iden)),by=.(y,x)] # count of iden by strata   
table(don$x,don$y) 

don<-merge(don,size[,.(y,strata)],by="x") #merge strata values  
don_strata=don[,.SD[sample(.N,strata)],by=.(y,x)]

There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments.有两个微妙的考虑会影响这个问题的答案,Josh O'Brien 和 Valentin 在评论中提到了这些。 The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).首先是通过.SD进行子集.SD非常低效,最好直接对.I进行采样(参见下面的基准)。

The second consideration, if we do sample from .I , is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1 .第二个注意事项,如果我们从.I进行采样,则调用sample(.I, size = 1)会导致当.I > 1length(.I) = 1时出现意外行为。 In this case, sample() behaves as if we called sample(1:.I, size = 1) , which is surely not what we want.在这种情况下, sample()行为就像我们调用了sample(1:.I, size = 1) ,这肯定不是我们想要的。 As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.正如 Valentin 所指出的,在这种情况下最好使用构造.I[sample(.N, size = 1)]

As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group.作为基准,我们构建了一个简单的 1,000 x 1 数据表并每组随机抽样。 Even with such a small data.table the .I method is roughly 20x faster.即使有这么小的 data.table, .I方法也快了大约 20 倍。

library(microbenchmark)
library(data.table)

set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))

microbenchmark(
  `.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
  `.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#>  expr       min        lq     mean    median        uq       max neval
#>    .I  2.396166  2.588275  3.22504  2.794152  3.118135  19.73236   100
#>   .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781   100

Created on 2020-12-02 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 12 月 2 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM