purrr; 来自具有概率列表的多列的样本

Question

Say I want to take a sample of values of variable length from an arbitrary number of different probability distributions, and with a weighted probability of sampling from each distribution. 假设我想从任意数量的不同概率分布中获取可变长度值的样本，并使用每个分布的加权抽样概率。

Seems like I should be able to do this using purrr 's map functions, but am struggling... 似乎我应该能够使用purrr的map功能来做到这一点，但我正在努力......

library(tidyverse)

set.seed(20171127)

# sample from 5 different probability distributions
dists <- tibble(
  samp_distA = round(rnorm(n=1000, mean=17, sd=4)),
  samp_distB = round(rnorm(n=1000, mean=13, sd=4)),
  samp_distC = round(rnorm(n=1000, mean=13, sd=4)),
  samp_distD = round(rbeta(n=1000, 2,8)*10),
  samp_distE = round(rnorm(n=1000, mean=8, sd=3))
  )

# define number of samples to be drawn for each group
n.times <- c(20,15,35,8,6) 

# define weights to be used for sampling from dists 
probs <- tibble(A = c(0.80, 0.05, 0.05, 0.05, 0.05),
                B = c(0.05, 0.80, 0.05, 0.05, 0.05),
                C = c(0.05, 0.05, 0.80, 0.05, 0.05),
                D = c(0.05, 0.05, 0.05, 0.80, 0.80),
                E = c(0.05, 0.05, 0.05, 0.05, 0.80)
                )

# sample from dists, n.times, and using probs as weights...
output <- map2(sample, size=n.times, weight=probs, tbl=dists)

#...doesn't work

Any suggestions gratefully received. 任何建议都感激不尽。

Answer 1

set.seed(123)
map2(
  n.times, 
  map(probs, rep, each = nrow(dists)),
  sample, x = flatten_dbl(dists), replace = TRUE
)

# [[1]]
#  [1] 15 13 18  6 15 15 12  8  9 12  7 17 14 12 15 10 18 19 24 24
# 
# [[2]]
#  [1] 12  2 15 16 14 17 11 11 10 12  6 19 13 12 13
# 
# [[3]]
#  [1] 10  9 16 12 13 11 10 18 14 19 16 16 12 19  4 15 19 19 13 14 15 10 14 12 10
# [26]  8 18 19  7  8 21  8 19 10  9
# 
# [[4]]
# [1]  3  3  2 15  1  4 14  2
# 
# [[5]]
# [1]  9 14 10  6 12  8

NB: I'm dubious about your answer to MrFlick's comment: "an 80% chance of selecting all values from samp_distA". 注意：我对你对MrFlick评论的回答表示怀疑：“有80％的机会从samp_distA中选择所有值”。 To me it is much more intuitive to go the other route: "an 80% chance for each of the 10 values to come from samp_distA"... so that's what I did. 对我来说，走另一条路线更为直观：“10％的价值中有80％的几率来自samp_distA”......这就是我所做的。 Do you confirm you want the former? 你确认你想要前者吗？

Base R equivalent: 基准R等价物：

set.seed(123)
mapply(
  sample,
  n.times, 
  lapply(probs, rep, each = nrow(dists)),
  MoreArgs = list(x = unlist(dists, use.names = FALSE), replace = TRUE)
)

Edit 编辑

Re your follow-up question in a comment ("run the function multiple times for each person, eg so that as output, person A had 10 lists of randomly-sampled values, each of length 20 (and similar for persons B, C, D, and E, perhaps with each person having a predefined different number of lists)"): 在评论中重新提出你的后续问题（“为每个人多次运行该函数，例如，为了输出，人员A有10个随机抽样值列表，每个长度为20（对于人B，C，类似， D和E，也许每个人都有预定义的不同数量的列表）“）：

n.reps <- c(A = 10, B = 1, C = 3, D = 2, E = 1)
set.seed(123)
pmap(
  list(n.reps, n.times, map(probs, rep, each = nrow(dists))),
  function(.x, .y, .z) replicate(
    .x,
    sample(flatten_dbl(dists), .y, replace = TRUE, .z),
    simplify = FALSE
  )
)

# $A
# $A[[1]]
# [1] 15 20 16 20 16 14 17 20 21 22 18 19 15 14 18 19 16 20  9 16
# 
# $A[[2]]
# [1] 13  9 11 19 25 19 11 18 16 19 16 21 15 12 11 11  9 13 20  1
# 
# $A[[3]]
# [1] 15 20 13 20 13 11 16 16 14 19 18 10 21 11 12 16 18 10 20 14
# 
# $A[[4]]
# [1] 16 19 14 11 17  9 20 11 19 13 11 16  8 11 10 18 27 22 20  4
# 
# $A[[5]]
# [1] 12 18 16 19 13 13 23 19 21 14 22  8  9 19 16 19  9 14 13 20
# 
# $A[[6]]
# [1] 18 26 16 15 21 17 15 19 14 18 19 25  5 16  7 19 21 15 23 16
# 
# $A[[7]]
# [1] 12 26 20 12  7  5 13 14 19  7 16 12 11 27 22 18 11 17 11 16
# 
# $A[[8]]
# [1] 21 18 24 22 18  0 15  3  9 16 16 11 16 20 22 18 18 20 16 21
# 
# $A[[9]]
# [1] 15 20 11 16 16 21 12 20 17  9 18 10 22 17 12  0 18 16 23 20
# 
# $A[[10]]
# [1] 16 22 15  4  7 19 18 13 15  1  7 18 21  1 20 21 15 12 20 15
# 
# 
# $B
# $B[[1]]
# [1]  9  5  8 17  9 10  7 13 12 11  9 21 10 15 12
# 
# 
# $C
# $C[[1]]
# [1] 15 15 16 13 19 14 16 15 11 15 19 16 19 12  6 12 10 12  1 18  9 10 18 11 19
# [26]  9  6 19 18 12  9 18 14 12  7
# 
# $C[[2]]
# [1]  5 14 16 10  8 13  8 18 22 18 14 12 13 10 19 12 15 10 16 13 16  9 15  6 15
# [26] 14  4  9 11 11  3 15 18 10 14
# 
# $C[[3]]
# [1] 13  8 12  9  6  9  2  7  8 12  2 11 20 10  1 14 14 11 11  1 13 13 18 14 12
# [26] 21 11  3  7  7 13 13 11  7 14
# 
# 
# $D
# $D[[1]]
# [1] 11  1  1  7 12  6  0  8
# 
# $D[[2]]
# [1]  4  1  7 15  2  2  8  9
# 
# 
# $E
# $E[[1]]
# [1]  7  8  6 11 10  6

Answer 2

This seems doable with purrr , but it takes a bit of set up, particularly because there's not a sample2 function (that I'm aware of) that samples a distribution based on a vector of probabilities, and then grabs a random sample from that subset. 这似乎可以用于purrr ，但它需要一些设置，特别是因为没有sample2函数（我知道）根据概率向量对分布进行采样，然后从该子集中获取随机样本。

To do that with purrr , we have to loop twice: the outside loops through each person using a simple numerical index; 要用purrr做到这purrr ，我们必须循环两次：外部循环使用一个简单的数字索引通过每个人; inside that loop, we loop through the n.times to get random samples from the appropriate distribution. 在该循环内部，我们遍历n.times以从适当的分布中获得随机样本。

# prep data ---------------------------------------------------------------

# pull all the controls into a single data frame 
controldf  <- tibble(
  cols = c(1:5), n.times
  ) %>% 
  bind_cols(probs %>% 
              t %>% 
              as.tibble %>% 
              setNames(c("distA", "distB", "distC", "distD", "distE")) 
  )

# turn the distrubtions into long form
longdists  <- dists %>% 
  gather(dist, val)

distnames  <- c("A", "B", "C", "D", "E")


# function to do the work  ---------------------------------------------------------------

getdist  <- function(i) {

  # get the probabilities as a numeric vector
  myprobs  <- controldf[i,3:7] %>% as.numeric

  # how many samples do we need
  myn      <- controldf[[i,2]]

  # use our probabilties to decide what distribution to grab from
  samplestoget  <- sample(distnames, myn, prob = myprobs, replace = T) %>% 
    paste0("samp_dist", .)

  # loop through our list of distributions to grab from 
  map_dbl(samplestoget, ~filter(
    # filter on distribution key
    longdists, dist == .x
    ) %>% 
    # from that distribution, select a single value at random
    sample_n(1) %>% 
    # extract the numeric value
    pluck('val') )

}

# get the values by running the function over our indexes  -------------------------

results  <- map(controldf$cols, ~ getdist(.x))

purrr; 来自具有概率列表的多列的样本

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-11-28 11:53:22

解决方案2
1 2017-11-27 19:36:34

purrr; 来自具有概率列表的多列的样本

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-11-28 11:53:22

解决方案2 1 2017-11-27 19:36:34

解决方案1
2 已采纳 2017-11-28 11:53:22

解决方案2
1 2017-11-27 19:36:34