简体   繁体   English

根据R中的分组变量对行组重新采样

[英]Resample groups of rows based on a grouping variable in R

I am relatively new to R so apologies if this is a silly / obvious question! 我对R比较陌生,所以很抱歉,这是一个愚蠢/显而易见的问题! I am interested in creating a new data set that is composed of collections of rows resampled with replacement from a larger data set. 我对创建一个新的数据集感兴趣,该数据集由从较大数据集中替换并重新采样的行集合组成。

The data set I have looks something like this, multiple rows per grouping variable. 我拥有的数据集看起来像这样,每个分组变量多行。

> df <- data.frame(value=c(1:5,1:4,1:3),ID=c(rep(1,5),rep(2,4),rep(3,3)))
> df
   value ID
1      1  1
2      2  1
3      3  1
4      4  1
5      5  1
6      1  2
7      2  2
8      3  2
9      4  2
10     1  3
11     2  3
12     3  3

What I'd like to do is create a new data set that is resampled (with replacement) based on the grouping variable. 我想做的是创建一个新的数据集,该数据集将基于分组变量进行重新采样(使用替换)。 So a resampled data set might look something like this: 因此,重新采样的数据集可能看起来像这样:

   value ID
1      1  1
2      2  1
3      3  1
4      4  1
5      5  1
6      1  3
7      2  3
8      3  3
9      1  1
10     2  1
11     3  1
12     4  1
13     5  1

Thanks for any suggestions! 感谢您的任何建议!

For sampling different number of rows per ID value, you can try something like this (assuming the ID value has a small number of unique values): 要为每个ID值采样不同数量的行,您可以尝试执行以下操作(假设ID值具有少量唯一值):

result <- NULL
result <- rbind(result, df[sample(row.names(df[df$ID == 1, ]), 10, replace = TRUE), ])
result <- rbind(result, df[sample(row.names(df[df$ID == 2, ]), 5, replace = TRUE), ])
result <- rbind(result, df[sample(row.names(df[df$ID == 3, ]), 3, replace = TRUE), ])
row.names(result) <- seq(1:nrow(result))

If there are many ID values, you may want to use a loop with the number of samples for each ID value you desire. 如果有许多ID值,则可能需要使用一个循环,其中包含所需的每个ID值的样本数。 For example, if there are six ID values and the corresponding numbers of samples for each ID are 10, 5, 3, 7, 8 and 2, you can do something like this: 例如,如果有六个ID值,并且每个ID对应的样本数分别为10、5、3、7、8和2,则可以执行以下操作:

nsamples <- c(10, 5, 3, 7, 8, 2)
result <- NULL
for (i in 1:length(nsamples)) {
  result <- rbind(result, df[sample(row.names(df[df$ID == i, ]), nsamples[i], replace = TRUE), ])
}
row.names(result) <- seq(1:nrow(result))

In either case, you will end up with output like this: 无论哪种情况,您最终都会得到如下输出:

   value ID
1      1  1
2      4  1
3      1  1
4      4  1
5      2  1
6      3  1
7      1  1
8      1  1
9      4  1
10     2  1
11     2  2
12     3  2
13     1  2
14     3  2
15     1  2
16     3  3
17     2  3
18     1  3

Using the above suggested dplyr solution, you can also do something like this for variable number of samples per ID value (it also requires pre-specifying number of samples per corresponding ID in a vector): 使用上面建议的dplyr解决方案,您还可以对每个ID值可变数量的样本执行以下操作(它还需要预先指定向量中每个对应ID的样本数量):

library(dplyr)
nsamples <- c(10, 5, 3)
df %>% group_by(ID) %>% slice(sample(n(), nsamples[ID], replace = TRUE))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM