[英]purrr; sample from multiple columns with probability list
假设我想从任意数量的不同概率分布中获取可变长度值的样本,并使用每个分布的加权抽样概率。
似乎我应该能够使用purrr
的map
功能来做到这一点,但我正在努力......
library(tidyverse)
set.seed(20171127)
# sample from 5 different probability distributions
dists <- tibble(
samp_distA = round(rnorm(n=1000, mean=17, sd=4)),
samp_distB = round(rnorm(n=1000, mean=13, sd=4)),
samp_distC = round(rnorm(n=1000, mean=13, sd=4)),
samp_distD = round(rbeta(n=1000, 2,8)*10),
samp_distE = round(rnorm(n=1000, mean=8, sd=3))
)
# define number of samples to be drawn for each group
n.times <- c(20,15,35,8,6)
# define weights to be used for sampling from dists
probs <- tibble(A = c(0.80, 0.05, 0.05, 0.05, 0.05),
B = c(0.05, 0.80, 0.05, 0.05, 0.05),
C = c(0.05, 0.05, 0.80, 0.05, 0.05),
D = c(0.05, 0.05, 0.05, 0.80, 0.80),
E = c(0.05, 0.05, 0.05, 0.05, 0.80)
)
# sample from dists, n.times, and using probs as weights...
output <- map2(sample, size=n.times, weight=probs, tbl=dists)
#...doesn't work
任何建议都感激不尽。
set.seed(123)
map2(
n.times,
map(probs, rep, each = nrow(dists)),
sample, x = flatten_dbl(dists), replace = TRUE
)
# [[1]]
# [1] 15 13 18 6 15 15 12 8 9 12 7 17 14 12 15 10 18 19 24 24
#
# [[2]]
# [1] 12 2 15 16 14 17 11 11 10 12 6 19 13 12 13
#
# [[3]]
# [1] 10 9 16 12 13 11 10 18 14 19 16 16 12 19 4 15 19 19 13 14 15 10 14 12 10
# [26] 8 18 19 7 8 21 8 19 10 9
#
# [[4]]
# [1] 3 3 2 15 1 4 14 2
#
# [[5]]
# [1] 9 14 10 6 12 8
注意:我对你对MrFlick评论的回答表示怀疑:“有80%的机会从samp_distA中选择所有值”。 对我来说,走另一条路线更为直观:“10%的价值中有80%的几率来自samp_distA”......这就是我所做的。 你确认你想要前者吗?
基准R等价物:
set.seed(123)
mapply(
sample,
n.times,
lapply(probs, rep, each = nrow(dists)),
MoreArgs = list(x = unlist(dists, use.names = FALSE), replace = TRUE)
)
编辑
在评论中重新提出你的后续问题(“为每个人多次运行该函数,例如,为了输出,人员A有10个随机抽样值列表,每个长度为20(对于人B,C,类似, D和E,也许每个人都有预定义的不同数量的列表)“):
n.reps <- c(A = 10, B = 1, C = 3, D = 2, E = 1)
set.seed(123)
pmap(
list(n.reps, n.times, map(probs, rep, each = nrow(dists))),
function(.x, .y, .z) replicate(
.x,
sample(flatten_dbl(dists), .y, replace = TRUE, .z),
simplify = FALSE
)
)
# $A
# $A[[1]]
# [1] 15 20 16 20 16 14 17 20 21 22 18 19 15 14 18 19 16 20 9 16
#
# $A[[2]]
# [1] 13 9 11 19 25 19 11 18 16 19 16 21 15 12 11 11 9 13 20 1
#
# $A[[3]]
# [1] 15 20 13 20 13 11 16 16 14 19 18 10 21 11 12 16 18 10 20 14
#
# $A[[4]]
# [1] 16 19 14 11 17 9 20 11 19 13 11 16 8 11 10 18 27 22 20 4
#
# $A[[5]]
# [1] 12 18 16 19 13 13 23 19 21 14 22 8 9 19 16 19 9 14 13 20
#
# $A[[6]]
# [1] 18 26 16 15 21 17 15 19 14 18 19 25 5 16 7 19 21 15 23 16
#
# $A[[7]]
# [1] 12 26 20 12 7 5 13 14 19 7 16 12 11 27 22 18 11 17 11 16
#
# $A[[8]]
# [1] 21 18 24 22 18 0 15 3 9 16 16 11 16 20 22 18 18 20 16 21
#
# $A[[9]]
# [1] 15 20 11 16 16 21 12 20 17 9 18 10 22 17 12 0 18 16 23 20
#
# $A[[10]]
# [1] 16 22 15 4 7 19 18 13 15 1 7 18 21 1 20 21 15 12 20 15
#
#
# $B
# $B[[1]]
# [1] 9 5 8 17 9 10 7 13 12 11 9 21 10 15 12
#
#
# $C
# $C[[1]]
# [1] 15 15 16 13 19 14 16 15 11 15 19 16 19 12 6 12 10 12 1 18 9 10 18 11 19
# [26] 9 6 19 18 12 9 18 14 12 7
#
# $C[[2]]
# [1] 5 14 16 10 8 13 8 18 22 18 14 12 13 10 19 12 15 10 16 13 16 9 15 6 15
# [26] 14 4 9 11 11 3 15 18 10 14
#
# $C[[3]]
# [1] 13 8 12 9 6 9 2 7 8 12 2 11 20 10 1 14 14 11 11 1 13 13 18 14 12
# [26] 21 11 3 7 7 13 13 11 7 14
#
#
# $D
# $D[[1]]
# [1] 11 1 1 7 12 6 0 8
#
# $D[[2]]
# [1] 4 1 7 15 2 2 8 9
#
#
# $E
# $E[[1]]
# [1] 7 8 6 11 10 6
这似乎可以用于purrr
,但它需要一些设置,特别是因为没有sample2函数(我知道)根据概率向量对分布进行采样,然后从该子集中获取随机样本。
要用purrr
做到这purrr
,我们必须循环两次:外部循环使用一个简单的数字索引通过每个人; 在该循环内部,我们遍历n.times
以从适当的分布中获得随机样本。
# prep data ---------------------------------------------------------------
# pull all the controls into a single data frame
controldf <- tibble(
cols = c(1:5), n.times
) %>%
bind_cols(probs %>%
t %>%
as.tibble %>%
setNames(c("distA", "distB", "distC", "distD", "distE"))
)
# turn the distrubtions into long form
longdists <- dists %>%
gather(dist, val)
distnames <- c("A", "B", "C", "D", "E")
# function to do the work ---------------------------------------------------------------
getdist <- function(i) {
# get the probabilities as a numeric vector
myprobs <- controldf[i,3:7] %>% as.numeric
# how many samples do we need
myn <- controldf[[i,2]]
# use our probabilties to decide what distribution to grab from
samplestoget <- sample(distnames, myn, prob = myprobs, replace = T) %>%
paste0("samp_dist", .)
# loop through our list of distributions to grab from
map_dbl(samplestoget, ~filter(
# filter on distribution key
longdists, dist == .x
) %>%
# from that distribution, select a single value at random
sample_n(1) %>%
# extract the numeric value
pluck('val') )
}
# get the values by running the function over our indexes -------------------------
results <- map(controldf$cols, ~ getdist(.x))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.