[英]stratified sampling of two groups
I have this table:我有这张桌子:
+-------+---------+
| group | n_purch |
+-------+---------+
| A | 39 |
| B | 30 |
| B | 39 |
| B | 56 |
| A | 38 |
| B | 19 |
| A | 55 |
| B | 11 |
......
The size is 7 million registers.大小为 700 万个寄存器。
A -> 20% of 7 million
B -> 80% of 7 million
I would like to do a stratified/proportional sampling but I don't know how I can get it.我想做一个分层/比例抽样,但我不知道我怎么能得到它。
I work with R and SQL.我与 R 和 SQL 合作。
Here is base R option using Map
+ split
这是使用
Map
+ split
的基本 R 选项
n_small <- 20
dfout <- do.call(
rbind,
Map(
function(dfs, ns) dfs[sample(seq(nrow(dfs)), ns, replace = TRUE), ],
split(df, df$group),
c(0.2, 0.8) * n_small
)
)
which gives这使
> dfout
group n_purch
A.24 A 371
A.1 A 582
A.21 A 718
A.33 A 843
B.17 B 642
B.41 B 110
B.27 B 326
B.18 B 45
B.48 B 733
B.44 B 29
B.47 B 918
B.42 B 84
B.40 B 791
B.31 B 616
B.30 B 838
B.18.1 B 45
B.44.1 B 29
B.5 B 537
B.30.1 B 838
B.2 B 121
Dummy Data虚拟数据
set.seed(1)
df <- data.frame(
group = sample(c("A", "B"), 50, replace = TRUE),
n_purch = sample(1000, 50)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.