简体   繁体   English

两组的分层抽样

[英]stratified sampling of two groups

I have this table:我有这张桌子:

+-------+---------+
| group | n_purch |
+-------+---------+
| A     | 39      |
| B     | 30      |
| B     | 39      |
| B     | 56      |
| A     | 38      |
| B     | 19      |
| A     | 55      |
| B     | 11      |
......

The size is 7 million registers.大小为 700 万个寄存器。

A -> 20% of 7 million
B -> 80% of 7 million

I would like to do a stratified/proportional sampling but I don't know how I can get it.我想做一个分层/比例抽样,但我不知道我怎么能得到它。

I work with R and SQL.我与 R 和 SQL 合作。

Here is base R option using Map + split这是使用Map + split的基本 R 选项

n_small <- 20
dfout <- do.call(
  rbind,
  Map(
    function(dfs, ns) dfs[sample(seq(nrow(dfs)), ns, replace = TRUE), ],
    split(df, df$group),
    c(0.2, 0.8) * n_small
  )
)

which gives这使

> dfout
       group n_purch
A.24       A     371
A.1        A     582
A.21       A     718
A.33       A     843
B.17       B     642
B.41       B     110
B.27       B     326
B.18       B      45
B.48       B     733
B.44       B      29
B.47       B     918
B.42       B      84
B.40       B     791
B.31       B     616
B.30       B     838
B.18.1     B      45
B.44.1     B      29
B.5        B     537
B.30.1     B     838
B.2        B     121

Dummy Data虚拟数据

set.seed(1)
df <- data.frame(
  group = sample(c("A", "B"), 50, replace = TRUE),
  n_purch = sample(1000, 50)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM