使用權重進行抽樣並用 sample_n() function

Question

全部，

我有一個 dplyr sample_n()問題。 我正在嘗試在使用重量選項時進行替換采樣，但我似乎遇到了障礙。 即，有放回抽樣始終是對一個組進行過抽樣。 這不是我在不更換取樣時遇到的問題，但如果可以的話，我真的很想進行更換取樣。

這是一個使用 package survey中熟悉的apistrat和apipop數據的最小工作示例。 R 的調查研究人員非常了解這些數據。 在人口數據（ apipop ）中，小學（ stype == E ）約占所有學校的 71.4%。 中學（ stype == M ）約占所有學校的 12.2%，高中（ stype == H ）約占所有學校的 16.4%。 apistrat有一個故意的不平衡，其中小學占數據的 50%，而中學和高中各占 200 行樣本的剩余 25%。

我想做的是使用sample_n() function 對apistrat數據進行采樣，並進行替換。 但是，我似乎一直在對小學進行過度抽樣，對中學和高中進行抽樣不足。 這是 R 代碼中的一個最小工作示例。 請原諒我的玉米球循環代碼。 我知道我需要在purrr做得更好，但我還沒有做到。 :P

library(survey)
library(tidyverse)

apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop

pop %>%
  group_by(stype) %>% 
  summarize(prop = n()/6194) -> Census

Census
# p(E) = ~.714
# p(H) = ~.122
# p(M) = ~.164

strat %>%
  left_join(., Census) -> strat

# Sampling with replacement seems to consistently oversample E and undersample H and M.
with_replace <- tibble()
set.seed(8675309) # Jenny, I got your number...

for (i in 1:1000) {
strat %>%
    sample_n(100, replace=T, weight = prop) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
with_replace <- bind_rows(with_replace, hold_this)

}

# group_by means with 95% intervals
with_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))

# ^ consistently oversampled E.
# meanprop of E = ~.835.
# meanprop of H = ~.070 and meanprop of M = ~.095
# 95% intervals don't include true probability for either E, H, or M.

# Sampling without replacement doesn't seem to have this same kind of sampling problem.
wo_replace <- tibble()
set.seed(8675309)  # Jenny, I got your number...

for (i in 1:1000) {
  strat %>%
    sample_n(100, replace=F, weight = prop) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
  wo_replace <- bind_rows(wo_replace, hold_this)

}

# group_by means with 95% intervals
wo_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))


# ^ better in orbit of the true probability
# meanprob of E = ~.757. meanprob of H = ~.106. meanprob of M = ~.137
# 95% intervals include true probability as well.

我不確定這是否是dplyr (v. 0.8.3) 問題。 替換抽樣的 95% 間隔不包括真實概率，並且每個樣本（您是否達到峰值）始終在 0.80 年代中期抽樣小學的范圍內。 在 1,000 個樣本（有替換）中，只有 3 個樣本的小學比例低於 100 行樣本的 72%。 就是這么一致。 我很好奇這里是否有人對正在發生的事情有任何見解，或者我可能做錯了什么，以及我是否誤解了sample_n()的功能。

提前致謝。

Answer 1

dplyr 中的sample_n() dplyr是base::sample.int()的 wapper。 查看base::sample.int() ——實際的 function 在 C 中實現。 我們可以看到問題來自源頭：

rows <- sample(nrow(strat), size = 100, replace=F, prob = strat$prop)
strat[rows, ] %>% count(stype)
# A tibble: 3 x 2
  stype     n
  <fct> <int>
1 E        74
2 H        14
3 M        12

rows <- sample(nrow(strat), size = 100, replace=T, prob = strat$prop)
strat[rows, ] %>% count(stype)
# A tibble: 3 x 2
  stype     n
  <fct> <int>
1 E        85
2 H         8
3 M         7

老實說，我不完全確定為什么會這樣，但是如果您使概率總和為 1 並使它們在組內一致，那么它給出了預期的樣本量：

library(tidyverse)
library(survey)

data(api)

apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop

pop %>%
  group_by(stype) %>% 
  summarize(prop = n()/6194) -> Census


strat %>%
  left_join(., Census) -> strat
#> Joining, by = "stype"

set.seed(8675309) # Jenny, I got your number...
with_replace <- tibble()

for (i in 1:1000) {
  strat %>%
    group_by(stype) %>%
    mutate(per_prob = sample(prop/n())) %>% 
    ungroup() %>% 
    sample_n(100, replace=T, weight = per_prob) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
  with_replace <- bind_rows(with_replace, hold_this)

}

with_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))
#> # A tibble: 3 x 4
#>   stype meanprop   lwr   upr
#>   <fct>    <dbl> <dbl> <dbl>
#> 1 E        0.713  0.63  0.79
#> 2 H        0.123  0.06  0.19
#> 3 M        0.164  0.09  0.24

^{由代表 package (v0.3.0) 於 2020 年 4 月 17 日創建}

我猜這與 p 的向量中的實體沒有被replace = TRUE減少有關，但我真的不知道引擎蓋下發生了什么。 有C知識的人應該看看！

使用權重進行抽樣並用 sample_n() function

問題描述

1 個解決方案

解決方案1
1 已采納 2020-04-18 00:36:17

使用權重進行抽樣並用 sample_n() function

問題描述

1 個解決方案

解決方案1 1 已采納 2020-04-18 00:36:17

解決方案1
1 已采納 2020-04-18 00:36:17