[英]Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows
I have a data frame with two categorical variables.我有一个包含两个分类变量的数据框。
samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
samples groups
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times.我想要的结果是对于每个给定的观察(样本组)将数据框下采样(随机地,这很重要)到最多 X 行,并保留所有出现少于 X 次的观察。 In the example here X=2.在此示例中,X=2。 Is there an easy way to do this?是否有捷径可寻? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.我遇到的问题是观察 4 (A,2) 只出现一次,因此 dplyr sample_n 不起作用。
desired output期望的输出
samples groups
1 A 1
2 A 1
3 A 2
4 B 1
5 B 1
You can sample minimum of number of rows or x
for each group :您可以为每组采样最少的行数或x
:
library(dplyr)
x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))
# samples groups
# <chr> <dbl>
#1 A 1
#2 A 1
#3 A 2
#4 B 1
#5 B 1
However, note that sample_n()
has been super-seeded in favor of slice_sample
but n()
doesn't work with slice_sample
.但是,请注意, sample_n()
已被超级播种以支持slice_sample
但n()
不适用于slice_sample
。 There is an open issue here for it.这里是一个开放的问题在这里吧。
However, as @tmfmnk mentioned we don't need to call n()
here.但是,正如@tmfmnk 提到的,我们不需要在这里调用n()
。 Try :尝试 :
df %>% group_by(samples, groups) %>% slice_sample(n = x)
One option with data.table
: data.table
一种选择:
df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]
samples groups
1: A 1
2: A 1
3: A 2
4: B 1
5: B 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.