简体   繁体   English

R:是否有一种干净的方法来获取循环中获得的样本的单个数据帧?

[英]R: Is there a clean way to obtain a single Data Frame of samples obtained in a loop?

I have a huge dataset containing observations about 1000 different entities. 我有一个庞大的数据集,其中包含有关1000个不同实体的观察结果。 Each entity has an ID between 1 and 1000 and there are no missing IDs. 每个实体的ID在1到1000之间,并且没有丢失的ID。 Since the dataset has more than 1 million rows, I want to obtain a subset with 10 random observations for each entity to make some analysis. 由于数据集有超过一百万行,因此我想为每个实体获取一个包含10个随机观测值的子集,以进行一些分析。

The following code does the trick, but it looks cumbersome and its performance is poor. 下面的代码可以解决问题,但是看起来很麻烦并且性能很差。

library(dplyr) # sample_n is a dplyr function
samples <- sample_n(dataset[dataset$Entity == 1, ], 10)
for (x in 2:1000) {
  samples <- rbind(samples, sample_n(dataset[dataset$Entity == x, ], 10))
}

Could you please share some ideas for doing the same in a better fashion? 您能否分享一些以更好的方式进行相同操作的想法?

Thanks in advance! 提前致谢!

I think you don't need to use a for loop when you already use dplyr . 我认为您已经使用dplyr时就不需要使用for循环。 The group_by command exists to do all the work you do with your for loop in a more efficient way. 存在group_by命令可以更有效地完成for循环的所有工作。

A simple example will be this: 一个简单的例子是这样的:

library(dplyr)

dt = data.frame(mtcars)

dt %>% group_by(cyl) %>% sample_n(3)

To sample 3 rows for each cyl value. 为每个cyl值采样3行。

So, consider that cyl here is your ID . 因此,请考虑此处cyl是您的ID Something like 就像是

your_dataset %>% group_by(ID) %>% sample_n(10)

will do the job. 会做的工作。

As an alternative to @AntoniosKs answer why not consider using data.table now that you have a large dataset. 作为@AntoniosKs的替代方法,既然您有一个大数据集,为什么不考虑使用data.table If your data is stored as a data table in DT and you want to sample 10 observations for each ID then 如果您的数据存储为DT的数据表,并且您想对每个ID采样10个观测值,则

library(data.table)

DT[, .SD[sample(.N,10)], by = ID]

should give you a substantial speedup. 应该会大大提高您的速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM