[英]Using weights for sampling with replacement with the sample_n() function
All,全部,
I have a dplyr sample_n()
question.我有一个 dplyr sample_n()
问题。 I'm trying to sample with replacement while using the weight option and I seem to be hitting a snag.我正在尝试在使用重量选项时进行替换采样,但我似乎遇到了障碍。 Namely, sampling with replacement is consistently oversampling a group.即,有放回抽样始终是对一个组进行过抽样。 It's not a problem I'm getting when sampling without replacement, but I'd really like to do sampling with replacement if I could.这不是我在不更换取样时遇到的问题,但如果可以的话,我真的很想进行更换取样。
Here's a minimal working example that uses the familiar apistrat
and apipop
data from the survey
package.这是一个使用 package survey
中熟悉的apistrat
和apipop
数据的最小工作示例。 Survey researchers in R know these data well. R 的调查研究人员非常了解这些数据。 In the population data ( apipop
), elementary schools ( stype == E
) account for about 71.4% of all schools.在人口数据( apipop
)中,小学( stype == E
)约占所有学校的 71.4%。 Middle schools ( stype == M
) are about 12.2% of all schools and high schools ( stype == H
) are about 16.4% of all schools.中学( stype == M
)约占所有学校的 12.2%,高中( stype == H
)约占所有学校的 16.4%。 The apistrat
has a deliberate imbalance in which the elementary schools are 50% of the data while middle schools and high schools are each the remaining 25% of the 200-row sample. apistrat
有一个故意的不平衡,其中小学占数据的 50%,而中学和高中各占 200 行样本的剩余 25%。
What I'd like to do is sample the apistrat
data, with replacement, using the sample_n()
function.我想做的是使用sample_n()
function 对apistrat
数据进行采样,并进行替换。 However, I seem to be consistently oversampling the elementary schools and undersampling the middle schools and high schools.但是,我似乎一直在对小学进行过度抽样,对中学和高中进行抽样不足。 Here's a minimal working example in R code.这是 R 代码中的一个最小工作示例。 Please forgive my cornball looping code.请原谅我的玉米球循环代码。 I know I need to get better at purrr
but I'm not quite there yet.我知道我需要在purrr
做得更好,但我还没有做到。 :P :P
library(survey)
library(tidyverse)
apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop
pop %>%
group_by(stype) %>%
summarize(prop = n()/6194) -> Census
Census
# p(E) = ~.714
# p(H) = ~.122
# p(M) = ~.164
strat %>%
left_join(., Census) -> strat
# Sampling with replacement seems to consistently oversample E and undersample H and M.
with_replace <- tibble()
set.seed(8675309) # Jenny, I got your number...
for (i in 1:1000) {
strat %>%
sample_n(100, replace=T, weight = prop) %>%
group_by(stype) %>%
summarize(i = i,
n = n(),
prop = n/100) -> hold_this
with_replace <- bind_rows(with_replace, hold_this)
}
# group_by means with 95% intervals
with_replace %>%
group_by(stype) %>%
summarize(meanprop = mean(prop),
lwr = quantile(prop, .025),
upr = quantile(prop, .975))
# ^ consistently oversampled E.
# meanprop of E = ~.835.
# meanprop of H = ~.070 and meanprop of M = ~.095
# 95% intervals don't include true probability for either E, H, or M.
# Sampling without replacement doesn't seem to have this same kind of sampling problem.
wo_replace <- tibble()
set.seed(8675309) # Jenny, I got your number...
for (i in 1:1000) {
strat %>%
sample_n(100, replace=F, weight = prop) %>%
group_by(stype) %>%
summarize(i = i,
n = n(),
prop = n/100) -> hold_this
wo_replace <- bind_rows(wo_replace, hold_this)
}
# group_by means with 95% intervals
wo_replace %>%
group_by(stype) %>%
summarize(meanprop = mean(prop),
lwr = quantile(prop, .025),
upr = quantile(prop, .975))
# ^ better in orbit of the true probability
# meanprob of E = ~.757. meanprob of H = ~.106. meanprob of M = ~.137
# 95% intervals include true probability as well.
I'm not sure if this is a dplyr
(v. 0.8.3) problem.我不确定这是否是dplyr
(v. 0.8.3) 问题。 The 95% intervals for sampling with replacement don't include the true probability and each sample (were you to peak at them) are consistently in that mid-.80s range for sampling the elementary schools.替换抽样的 95% 间隔不包括真实概率,并且每个样本(您是否达到峰值)始终在 0.80 年代中期抽样小学的范围内。 Only three of the 1,000 samples (with replacement) had a composition where elementary schools were fewer than 72% of the 100-row sample.在 1,000 个样本(有替换)中,只有 3 个样本的小学比例低于 100 行样本的 72%。 It's that consistent.就是这么一致。 I'm curious if anyone here as any insight to what's happening, or possibly what I might be doing wrong and if I'm misinterpreting the functionality of sample_n()
.我很好奇这里是否有人对正在发生的事情有任何见解,或者我可能做错了什么,以及我是否误解了sample_n()
的功能。
Thanks in advance.提前致谢。
The sample_n()
function in dplyr
is a wapper for base::sample.int()
. dplyr 中的sample_n()
dplyr
是base::sample.int()
的 wapper。 Looking at base::sample.int()
--and the actual function is implemented in C.查看base::sample.int()
——实际的 function 在 C 中实现。 And we can see that the problem comes from the source:我们可以看到问题来自源头:
rows <- sample(nrow(strat), size = 100, replace=F, prob = strat$prop)
strat[rows, ] %>% count(stype)
# A tibble: 3 x 2
stype n
<fct> <int>
1 E 74
2 H 14
3 M 12
rows <- sample(nrow(strat), size = 100, replace=T, prob = strat$prop)
strat[rows, ] %>% count(stype)
# A tibble: 3 x 2
stype n
<fct> <int>
1 E 85
2 H 8
3 M 7
I'm honestly not totally sure why this is the case, but if you make the probabilities sum to 1 and make them uniform within group, then it gives the sample sizes expected:老实说,我不完全确定为什么会这样,但是如果您使概率总和为 1 并使它们在组内一致,那么它给出了预期的样本量:
library(tidyverse)
library(survey)
data(api)
apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop
pop %>%
group_by(stype) %>%
summarize(prop = n()/6194) -> Census
strat %>%
left_join(., Census) -> strat
#> Joining, by = "stype"
set.seed(8675309) # Jenny, I got your number...
with_replace <- tibble()
for (i in 1:1000) {
strat %>%
group_by(stype) %>%
mutate(per_prob = sample(prop/n())) %>%
ungroup() %>%
sample_n(100, replace=T, weight = per_prob) %>%
group_by(stype) %>%
summarize(i = i,
n = n(),
prop = n/100) -> hold_this
with_replace <- bind_rows(with_replace, hold_this)
}
with_replace %>%
group_by(stype) %>%
summarize(meanprop = mean(prop),
lwr = quantile(prop, .025),
upr = quantile(prop, .975))
#> # A tibble: 3 x 4
#> stype meanprop lwr upr
#> <fct> <dbl> <dbl> <dbl>
#> 1 E 0.713 0.63 0.79
#> 2 H 0.123 0.06 0.19
#> 3 M 0.164 0.09 0.24
Created on 2020-04-17 by the reprex package (v0.3.0)由代表 package (v0.3.0) 于 2020 年 4 月 17 日创建
I'm guessing that this has something to do with the entities within the vector of p not being diminished by replace = TRUE
, but really I have no idea what's going on under the hood.我猜这与 p 的向量中的实体没有被replace = TRUE
减少有关,但我真的不知道引擎盖下发生了什么。 Someone with C knowledge should take a look!有C知识的人应该看看!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.