使用dplyr的sample_n函数按组进行采样

Question

According to the dplyr help file the sample_n function samples a fixed number per group. 根据dplyr帮助文件， sample_n函数每组采样一个固定数。

When I run the following code I expect two samples per tobgp and alcgp combination, so 32 (4*4*2) lines in total. 当我运行以下代码时，我希望每个tobgp和alcgp组合有两个样本，所以总共有32（4 * 4 * 2）行。 However only two lines are returned. 但是只返回两行。

by_tobgp_alcgp <- esoph %>% group_by(tobgp,alcgp)

sample_n(by_tobgp_alcgp , 2)


# Source: local data frame [2 x 5]
# Groups: tobgp, alcgp
# 
#    agegp     alcgp tobgp ncases ncontrols
# 10 25-34    80-119 10-19      0         1
# 50 55-64 0-39g/day   30+      4         6

Is this correct? 它是否正确？ Is there an alternative way to achieve this using dplyr ? 有没有其他方法可以使用dplyr实现这一dplyr ？

Answer 1

The issue that @Henrik described has been closed . @Henrik描述的问题已经关闭。

The sample_n operator will work on grouped data frames for dplyr versions >= 0.3 sample_n运算符将处理dplyr版本> = 0.3的分组数据帧

library(dplyr)
data(esoph)

set.seed(123)

esoph %>%
  group_by(tobgp, alcgp) %>%
  sample_n(2)

#Source: local data frame [32 x 5]
#Groups: tobgp, alcgp [16]
#
#agegp     alcgp    tobgp ncases ncontrols
#(fctr)    (fctr)   (fctr)  (dbl)     (dbl)
#1   65-74 0-39g/day 0-9g/day      5        48
#2   25-34 0-39g/day 0-9g/day      0        40
#3   45-54     40-79 0-9g/day      6        38
#4     75+     40-79 0-9g/day      2         5
#5   55-64    80-119 0-9g/day      9        18
#6   35-44    80-119 0-9g/day      0        11
#7   45-54      120+ 0-9g/day      4         4
#8   65-74      120+ 0-9g/day      3         4
#9   45-54 0-39g/day    10-19      0        18
#10  65-74 0-39g/day    10-19      4        14
#11    75+     40-79    10-19      1         3
#12  55-64     40-79    10-19      6        21
#13  45-54    80-119    10-19      6        14
#14  25-34    80-119    10-19      0         1
#15    75+      120+    10-19      1         1
#16  35-44      120+    10-19      0         3
#17  25-34 0-39g/day    20-29      0         6
#18  55-64 0-39g/day    20-29      3        12
#19  65-74     40-79    20-29      5         9
#20  25-34     40-79    20-29      0         4
#21  55-64    80-119    20-29      3         6
#22  65-74    80-119    20-29      2         3
#23  45-54      120+    20-29      2         3
#24  35-44      120+    20-29      2         4
#25  55-64 0-39g/day      30+      4         6
#26  35-44 0-39g/day      30+      0         8
#27  35-44     40-79      30+      0         8
#28  25-34     40-79      30+      0         7
#29  35-44    80-119      30+      0         1
#30  55-64    80-119      30+      4         4
#31  25-34      120+      30+      0         2
#32  65-74      120+      30+      1         1

Answer 2

You can work around this issue by using the do operator: 您可以使用do运算符解决此问题：

library(dplyr)
data(esoph)

esoph %>%
  group_by(tobgp, alcgp) %>%
  do(sample_n(., 2))

#   agegp     alcgp    tobgp ncases ncontrols
#1    75+ 0-39g/day 0-9g/day      1        18
#2  35-44 0-39g/day 0-9g/day      0        60
#3  55-64     40-79 0-9g/day      9        40
#4    75+     40-79 0-9g/day      2         5
#5  65-74    80-119 0-9g/day      6        13
#6  55-64    80-119 0-9g/day      9        18
#7  65-74      120+ 0-9g/day      3         4
#8  25-34      120+ 0-9g/day      0         1
#9  25-34 0-39g/day    10-19      0        10
#10 35-44 0-39g/day    10-19      1        14
#11 65-74     40-79    10-19      3        10
#12 55-64     40-79    10-19      6        21
#13 55-64    80-119    10-19      8        15
#14 35-44    80-119    10-19      0         6
#15 25-34      120+    10-19      1         1
#16 35-44      120+    10-19      0         3
#17 25-34 0-39g/day    20-29      0         6
#18 35-44 0-39g/day    20-29      0         7
#19 45-54     40-79    20-29      5        15
#20   75+     40-79    20-29      0         3
#21 65-74    80-119    20-29      2         3
#22 45-54    80-119    20-29      1         5
#23 55-64      120+    20-29      2         3
#24 45-54      120+    20-29      2         3
#25 25-34 0-39g/day      30+      0         5
#26 55-64 0-39g/day      30+      4         6
#27 25-34     40-79      30+      0         7
#28   75+     40-79      30+      1         1
#29 55-64    80-119      30+      4         4
#30 35-44    80-119      30+      0         1
#31 55-64      120+      30+      5         6
#32 25-34      120+      30+      0         2

Edit after comment: 评论后编辑：

For group data, the do operator applies a function to each group of data which is what is wanted in this case (select a sample of size 2 of each group). 对于组数据， do运算符将函数应用于在这种情况下所需的每组数据（选择每组的大小为2的样本）。 So while sample_n doesn't work on grouped data (not sure if it is supposed to, but I guess it should work), using do(sample_n(data,n)) does the job as desired. 因此，虽然sample_n不适用于分组数据（不确定它是否应该，但我认为它应该工作），使用do(sample_n(data,n))可以根据需要完成工作。

使用dplyr的sample_n函数按组进行采样

问题描述

2 个解决方案

解决方案1
6 已采纳 2016-06-21 23:28:22

解决方案2
5 2014-06-14 19:58:16

使用dplyr的sample_n函数按组进行采样

问题描述

2 个解决方案

解决方案1 6 已采纳 2016-06-21 23:28:22

解决方案2 5 2014-06-14 19:58:16

解决方案1
6 已采纳 2016-06-21 23:28:22

解决方案2
5 2014-06-14 19:58:16