R：是否有一种干净的方法来获取循环中获得的样本的单个数据帧？

Question

I have a huge dataset containing observations about 1000 different entities. 我有一个庞大的数据集，其中包含有关1000个不同实体的观察结果。 Each entity has an ID between 1 and 1000 and there are no missing IDs. 每个实体的ID在1到1000之间，并且没有丢失的ID。 Since the dataset has more than 1 million rows, I want to obtain a subset with 10 random observations for each entity to make some analysis. 由于数据集有超过一百万行，因此我想为每个实体获取一个包含10个随机观测值的子集，以进行一些分析。

The following code does the trick, but it looks cumbersome and its performance is poor. 下面的代码可以解决问题，但是看起来很麻烦并且性能很差。

library(dplyr) # sample_n is a dplyr function
samples <- sample_n(dataset[dataset$Entity == 1, ], 10)
for (x in 2:1000) {
  samples <- rbind(samples, sample_n(dataset[dataset$Entity == x, ], 10))
}

Could you please share some ideas for doing the same in a better fashion? 您能否分享一些以更好的方式进行相同操作的想法？

Thanks in advance! 提前致谢！

Answer 1

I think you don't need to use a for loop when you already use dplyr . 我认为您已经使用dplyr时就不需要使用for循环。 The group_by command exists to do all the work you do with your for loop in a more efficient way. 存在group_by命令可以更有效地完成for循环的所有工作。

A simple example will be this: 一个简单的例子是这样的：

library(dplyr)

dt = data.frame(mtcars)

dt %>% group_by(cyl) %>% sample_n(3)

To sample 3 rows for each cyl value. 为每个cyl值采样3行。

So, consider that cyl here is your ID . 因此，请考虑此处cyl是您的ID 。 Something like 就像是

your_dataset %>% group_by(ID) %>% sample_n(10)

will do the job. 会做的工作。

Answer 2

As an alternative to @AntoniosKs answer why not consider using data.table now that you have a large dataset. 作为@AntoniosKs的替代方法，既然您有一个大数据集，为什么不考虑使用data.table 。 If your data is stored as a data table in DT and you want to sample 10 observations for each ID then 如果您的数据存储为DT的数据表，并且您想对每个ID采样10个观测值，则

library(data.table)

DT[, .SD[sample(.N,10)], by = ID]

should give you a substantial speedup. 应该会大大提高您的速度。

R：是否有一种干净的方法来获取循环中获得的样本的单个数据帧？

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-10-03 21:08:41

解决方案2
2 2015-10-03 21:15:47

R：是否有一种干净的方法来获取循环中获得的样本的单个数据帧？

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-10-03 21:08:41

解决方案2 2 2015-10-03 21:15:47

解决方案1
2 已采纳 2015-10-03 21:08:41

解决方案2
2 2015-10-03 21:15:47