简体   繁体   English

R dataframe 的更高效的块重采样

[英]More efficient block resampling of R dataframe

I'm trying to resample an R dataframe in a clustered/blocked way.我正在尝试以集群/阻塞的方式重新采样 R dataframe 。 I'm doing so with the code snippet below, but its quite slow:我正在使用下面的代码片段这样做,但速度很慢:

    index_sample <- sample(unique(data[[cluster_var]]), 
                           size=length(unique(data[[cluster_var]])), replace=T)

    indices <- unlist(sapply(index_sample, FUN=function(x) {which(data[[cluster_var]] == x)}))

Is there a more efficient way to do this?有没有更有效的方法来做到这一点? The unlist/sapply step in particular seems very slow.特别是 unlist/sapply 步骤似乎非常慢。

Example of desired behavior:所需行为的示例:

set.seed(1919)
data <- data.frame(x=sample(seq(1,5,1), 20, replace=TRUE),
                   y = runif(20))
index_sample <- sample(unique(data[['x']]), 
                       size=length(unique(data[['x']])), replace=T)
indices <- unlist(sapply(index_sample, FUN=function(x) {which(data[['x']] == x)}))
print(indices)
[1]  7  8  9 10 14 17 20  7  8  9 10 14 17 20  1 12  2 18 19  6 11 13 16    

We can use outer我们可以使用outer

indices2 <- which(outer(data$x, index_sample, FUN = `==`), arr.ind = TRUE)[,1]

-testing with OP's solution - 使用 OP 的解决方案进行测试

identical(indices, indices2)
#[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM