当某些观察值少于 n 行时，使用 dplyr 在数据框中每组采样 n 个随机行

Question

I have a data frame with two categorical variables.我有一个包含两个分类变量的数据框。

samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
  samples groups
1       A      1
2       A      1
3       A      1
4       A      2
5       B      1
6       B      1

The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times.我想要的结果是对于每个给定的观察（样本组）将数据框下采样（随机地，这很重要）到最多 X 行，并保留所有出现少于 X 次的观察。 In the example here X=2.在此示例中，X=2。 Is there an easy way to do this?是否有捷径可寻？ The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.我遇到的问题是观察 4 (A,2) 只出现一次，因此 dplyr sample_n 不起作用。

desired output期望的输出

  samples groups
1       A      1
2       A      1
3       A      2
4       B      1
5       B      1

Answer 1

You can sample minimum of number of rows or x for each group :您可以为每组采样最少的行数或x ：

library(dplyr)

x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))

#  samples groups
#  <chr>    <dbl>
#1 A            1
#2 A            1
#3 A            2
#4 B            1
#5 B            1

However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample .但是，请注意， sample_n()已被超级播种以支持slice_sample但n()不适用于slice_sample 。 There is an open issue here for it.这里是一个开放的问题在这里吧。

However, as @tmfmnk mentioned we don't need to call n() here.但是，正如@tmfmnk 提到的，我们不需要在这里调用n() 。 Try :尝试：

df %>% group_by(samples, groups) %>% slice_sample(n = x)

Answer 2

One option with data.table : data.table一种选择：

df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]

   samples groups
1:       A      1
2:       A      1
3:       A      2
4:       B      1
5:       B      1

当某些观察值少于 n 行时，使用 dplyr 在数据框中每组采样 n 个随机行

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-10-29 08:56:08

解决方案2
1 2020-10-29 13:28:44

当某些观察值少于 n 行时，使用 dplyr 在数据框中每组采样 n 个随机行

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-10-29 08:56:08

解决方案2 1 2020-10-29 13:28:44

解决方案1
2 已采纳 2020-10-29 08:56:08

解决方案2
1 2020-10-29 13:28:44