[英]purrr; sample from multiple columns with probability list
Say I want to take a sample of values of variable length from an arbitrary number of different probability distributions, and with a weighted probability of sampling from each distribution. 假设我想从任意数量的不同概率分布中获取可变长度值的样本,并使用每个分布的加权抽样概率。
Seems like I should be able to do this using purrr
's map
functions, but am struggling... 似乎我应该能够使用purrr
的map
功能来做到这一点,但我正在努力......
library(tidyverse)
set.seed(20171127)
# sample from 5 different probability distributions
dists <- tibble(
samp_distA = round(rnorm(n=1000, mean=17, sd=4)),
samp_distB = round(rnorm(n=1000, mean=13, sd=4)),
samp_distC = round(rnorm(n=1000, mean=13, sd=4)),
samp_distD = round(rbeta(n=1000, 2,8)*10),
samp_distE = round(rnorm(n=1000, mean=8, sd=3))
)
# define number of samples to be drawn for each group
n.times <- c(20,15,35,8,6)
# define weights to be used for sampling from dists
probs <- tibble(A = c(0.80, 0.05, 0.05, 0.05, 0.05),
B = c(0.05, 0.80, 0.05, 0.05, 0.05),
C = c(0.05, 0.05, 0.80, 0.05, 0.05),
D = c(0.05, 0.05, 0.05, 0.80, 0.80),
E = c(0.05, 0.05, 0.05, 0.05, 0.80)
)
# sample from dists, n.times, and using probs as weights...
output <- map2(sample, size=n.times, weight=probs, tbl=dists)
#...doesn't work
Any suggestions gratefully received. 任何建议都感激不尽。
set.seed(123)
map2(
n.times,
map(probs, rep, each = nrow(dists)),
sample, x = flatten_dbl(dists), replace = TRUE
)
# [[1]]
# [1] 15 13 18 6 15 15 12 8 9 12 7 17 14 12 15 10 18 19 24 24
#
# [[2]]
# [1] 12 2 15 16 14 17 11 11 10 12 6 19 13 12 13
#
# [[3]]
# [1] 10 9 16 12 13 11 10 18 14 19 16 16 12 19 4 15 19 19 13 14 15 10 14 12 10
# [26] 8 18 19 7 8 21 8 19 10 9
#
# [[4]]
# [1] 3 3 2 15 1 4 14 2
#
# [[5]]
# [1] 9 14 10 6 12 8
NB: I'm dubious about your answer to MrFlick's comment: "an 80% chance of selecting all values from samp_distA". 注意:我对你对MrFlick评论的回答表示怀疑:“有80%的机会从samp_distA中选择所有值”。 To me it is much more intuitive to go the other route: "an 80% chance for each of the 10 values to come from samp_distA"... so that's what I did. 对我来说,走另一条路线更为直观:“10%的价值中有80%的几率来自samp_distA”......这就是我所做的。 Do you confirm you want the former? 你确认你想要前者吗?
Base R equivalent: 基准R等价物:
set.seed(123)
mapply(
sample,
n.times,
lapply(probs, rep, each = nrow(dists)),
MoreArgs = list(x = unlist(dists, use.names = FALSE), replace = TRUE)
)
Edit 编辑
Re your follow-up question in a comment ("run the function multiple times for each person, eg so that as output, person A had 10 lists of randomly-sampled values, each of length 20 (and similar for persons B, C, D, and E, perhaps with each person having a predefined different number of lists)"): 在评论中重新提出你的后续问题(“为每个人多次运行该函数,例如,为了输出,人员A有10个随机抽样值列表,每个长度为20(对于人B,C,类似, D和E,也许每个人都有预定义的不同数量的列表)“):
n.reps <- c(A = 10, B = 1, C = 3, D = 2, E = 1)
set.seed(123)
pmap(
list(n.reps, n.times, map(probs, rep, each = nrow(dists))),
function(.x, .y, .z) replicate(
.x,
sample(flatten_dbl(dists), .y, replace = TRUE, .z),
simplify = FALSE
)
)
# $A
# $A[[1]]
# [1] 15 20 16 20 16 14 17 20 21 22 18 19 15 14 18 19 16 20 9 16
#
# $A[[2]]
# [1] 13 9 11 19 25 19 11 18 16 19 16 21 15 12 11 11 9 13 20 1
#
# $A[[3]]
# [1] 15 20 13 20 13 11 16 16 14 19 18 10 21 11 12 16 18 10 20 14
#
# $A[[4]]
# [1] 16 19 14 11 17 9 20 11 19 13 11 16 8 11 10 18 27 22 20 4
#
# $A[[5]]
# [1] 12 18 16 19 13 13 23 19 21 14 22 8 9 19 16 19 9 14 13 20
#
# $A[[6]]
# [1] 18 26 16 15 21 17 15 19 14 18 19 25 5 16 7 19 21 15 23 16
#
# $A[[7]]
# [1] 12 26 20 12 7 5 13 14 19 7 16 12 11 27 22 18 11 17 11 16
#
# $A[[8]]
# [1] 21 18 24 22 18 0 15 3 9 16 16 11 16 20 22 18 18 20 16 21
#
# $A[[9]]
# [1] 15 20 11 16 16 21 12 20 17 9 18 10 22 17 12 0 18 16 23 20
#
# $A[[10]]
# [1] 16 22 15 4 7 19 18 13 15 1 7 18 21 1 20 21 15 12 20 15
#
#
# $B
# $B[[1]]
# [1] 9 5 8 17 9 10 7 13 12 11 9 21 10 15 12
#
#
# $C
# $C[[1]]
# [1] 15 15 16 13 19 14 16 15 11 15 19 16 19 12 6 12 10 12 1 18 9 10 18 11 19
# [26] 9 6 19 18 12 9 18 14 12 7
#
# $C[[2]]
# [1] 5 14 16 10 8 13 8 18 22 18 14 12 13 10 19 12 15 10 16 13 16 9 15 6 15
# [26] 14 4 9 11 11 3 15 18 10 14
#
# $C[[3]]
# [1] 13 8 12 9 6 9 2 7 8 12 2 11 20 10 1 14 14 11 11 1 13 13 18 14 12
# [26] 21 11 3 7 7 13 13 11 7 14
#
#
# $D
# $D[[1]]
# [1] 11 1 1 7 12 6 0 8
#
# $D[[2]]
# [1] 4 1 7 15 2 2 8 9
#
#
# $E
# $E[[1]]
# [1] 7 8 6 11 10 6
This seems doable with purrr
, but it takes a bit of set up, particularly because there's not a sample2 function (that I'm aware of) that samples a distribution based on a vector of probabilities, and then grabs a random sample from that subset. 这似乎可以用于purrr
,但它需要一些设置,特别是因为没有sample2函数(我知道)根据概率向量对分布进行采样,然后从该子集中获取随机样本。
To do that with purrr
, we have to loop twice: the outside loops through each person using a simple numerical index; 要用purrr
做到这purrr
,我们必须循环两次:外部循环使用一个简单的数字索引通过每个人; inside that loop, we loop through the n.times
to get random samples from the appropriate distribution. 在该循环内部,我们遍历n.times
以从适当的分布中获得随机样本。
# prep data ---------------------------------------------------------------
# pull all the controls into a single data frame
controldf <- tibble(
cols = c(1:5), n.times
) %>%
bind_cols(probs %>%
t %>%
as.tibble %>%
setNames(c("distA", "distB", "distC", "distD", "distE"))
)
# turn the distrubtions into long form
longdists <- dists %>%
gather(dist, val)
distnames <- c("A", "B", "C", "D", "E")
# function to do the work ---------------------------------------------------------------
getdist <- function(i) {
# get the probabilities as a numeric vector
myprobs <- controldf[i,3:7] %>% as.numeric
# how many samples do we need
myn <- controldf[[i,2]]
# use our probabilties to decide what distribution to grab from
samplestoget <- sample(distnames, myn, prob = myprobs, replace = T) %>%
paste0("samp_dist", .)
# loop through our list of distributions to grab from
map_dbl(samplestoget, ~filter(
# filter on distribution key
longdists, dist == .x
) %>%
# from that distribution, select a single value at random
sample_n(1) %>%
# extract the numeric value
pluck('val') )
}
# get the values by running the function over our indexes -------------------------
results <- map(controldf$cols, ~ getdist(.x))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.