[英]R: Is there a clean way to obtain a single Data Frame of samples obtained in a loop?
I have a huge dataset containing observations about 1000 different entities. 我有一个庞大的数据集,其中包含有关1000个不同实体的观察结果。 Each entity has an ID between 1 and 1000 and there are no missing IDs. 每个实体的ID在1到1000之间,并且没有丢失的ID。 Since the dataset has more than 1 million rows, I want to obtain a subset with 10 random observations for each entity to make some analysis. 由于数据集有超过一百万行,因此我想为每个实体获取一个包含10个随机观测值的子集,以进行一些分析。
The following code does the trick, but it looks cumbersome and its performance is poor. 下面的代码可以解决问题,但是看起来很麻烦并且性能很差。
library(dplyr) # sample_n is a dplyr function
samples <- sample_n(dataset[dataset$Entity == 1, ], 10)
for (x in 2:1000) {
samples <- rbind(samples, sample_n(dataset[dataset$Entity == x, ], 10))
}
Could you please share some ideas for doing the same in a better fashion? 您能否分享一些以更好的方式进行相同操作的想法?
Thanks in advance! 提前致谢!
I think you don't need to use a for loop when you already use dplyr
. 我认为您已经使用dplyr
时就不需要使用for循环。 The group_by
command exists to do all the work you do with your for loop in a more efficient way. 存在group_by
命令可以更有效地完成for循环的所有工作。
A simple example will be this: 一个简单的例子是这样的:
library(dplyr)
dt = data.frame(mtcars)
dt %>% group_by(cyl) %>% sample_n(3)
To sample 3 rows for each cyl
value. 为每个cyl
值采样3行。
So, consider that cyl
here is your ID
. 因此,请考虑此处cyl
是您的ID
。 Something like 就像是
your_dataset %>% group_by(ID) %>% sample_n(10)
will do the job. 会做的工作。
As an alternative to @AntoniosKs answer why not consider using data.table
now that you have a large dataset. 作为@AntoniosKs的替代方法,既然您有一个大数据集,为什么不考虑使用data.table
。 If your data is stored as a data table in DT
and you want to sample 10 observations for each ID
then 如果您的数据存储为DT
的数据表,并且您想对每个ID
采样10个观测值,则
library(data.table)
DT[, .SD[sample(.N,10)], by = ID]
should give you a substantial speedup. 应该会大大提高您的速度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.