使用dplyr和sample_n根据权重随机抽样

Question

I would like to randomly sample months according to a set of weights given by an index in a separate data frame, but the index changes according to the some categorical variables. 我想根据一个单独数据框中索引给出的一组权重随机抽样几个月，但索引会根据一些分类变量而变化。

Below is an example problem: 以下是一个示例问题：

require(dplyr)
sim.size <- 1000
# Generating the weights for each month, and category combination
class_probs <- data_frame(categoryA=rep(letters[1:3],24)
                          categoryB=rep(LETTERS[1:2],each=36),
                          Month=rep(month.name,6),
                          MonthIndex=runif(72))


# Generating some randomly simulated cateogories
sim.data <- data_frame(categoryA=sample(letters[1:3],size=sim.size,replace=TRUE),
                       categoryB=sample(LETTERS[1:2],size=sim.size,replace=TRUE))

# This is where i need help
# I would like to add an extra column called Month on the end of sim.data
# That will be sampled using the class_probs data, taking into account the
# Both categoryA and categoryB to generate the weights in MonthIndex
sim.data %>%
  group_by(categoryA,categoryB) %>%
  do(sample_n(class_probs[class_probs$categoryA==categoryA &
                          class_probs$categoryB==categoryB,  ],
              size=nrow(sim.data[sim.data$categoryA==categoryA &
                                 sim.data$categoryB==categoryB]),
              replace=TRUE,
             weight=MonthIndex)$Month)

So for each group i would like to be able to sample the same number of occurrences of a particular combination of categoryA and categoryB, and for each occurrence i would like to sample a Month according to the MonthIndex given from the subset of the class_prob data.frame... 因此，对于每个组，我希望能够对类别A和类别B的特定组合的相同出现次数进行采样，并且对于每次出现，我想根据从class_prob数据的子集给出的MonthIndex对一个月进行采样。帧...

The chosen Month is then binded onto the original dataset sim.data as an extra column 然后将选定的月份绑定到原始数据集sim.data作为额外列

Hopefully my code is already quite close...i just need a bit of help working out what bits need to change... 希望我的代码已经非常接近......我只需要帮助解决一些需要改变的问题......

Answer 1

Here's an approach with a helper function to do the sampling, then a simple mutate call in dplyr to create the new column. 这是一种使用辅助函数进行采样的方法，然后在dplyr进行简单的mutate调用以创建新列。

Helper function: 辅助功能：

sampler <- function(x, y, df) {

  tab <- sample_n(df %>% filter(categoryA==x, 
                  categoryB==y),
           size=1,
           replace=TRUE,
           weight=MonthIndex)

  return(tab$Month)

}

Calling it to create a new variable: 调用它来创建一个新变量：

sim.data %>%
  rowwise() %>%
  mutate(month = sampler(categoryA, categoryB, class_probs))

Result: 结果：

Source: local data frame [1,000 x 3]
Groups: <by row>

   categoryA categoryB     month
1          b         B  February
2          b         A  February
3          b         B       May
4          c         B  December
5          c         B      June
6          b         A    August
7          c         A     March
8          c         A September
9          b         A    August
10         c         A  December
..       ...       ...       ...

使用dplyr和sample_n根据权重随机抽样

问题描述

1 个解决方案

解决方案1
0 2015-04-09 13:13:47

使用dplyr和sample_n根据权重随机抽样

问题描述

1 个解决方案

解决方案1 0 2015-04-09 13:13:47

解决方案1
0 2015-04-09 13:13:47