简体   繁体   English

dplyr sample_n 按组,每组具有唯一的大小参数

[英]dplyr sample_n by group with unique size argument per group

I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.我试图从数据集中抽取一个分层样本,其中存在一个变量,表明每组的样本量应该有多大。

library(dplyr)
# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))

In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group.在这个例子中, grp指的是我想要采样的组,而frq是为该组指定的样本大小。

Using split , I came up with this possible solution, which gives the desired result but seems rather inefficient :使用split ,我想出了这个可能的解决方案,它给出了想要的结果,但似乎效率很低:

s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>% 
      do.call(what = rbind)

Is there a way using just dplyr's group_by and sample_n to do this?有没有办法只使用 dplyr 的group_bysample_n来做到这一点?

My first thought was:我的第一个想法是:

df %>% group_by(grp) %>% sample_n(size = frq)

but this gives the error:但这给出了错误:

Error in is_scalar_integerish(size) : object 'frq' not found is_scalar_integerish(size) 中的错误:找不到对象“frq”

This works:这有效:

df %>% group_by(grp) %>% sample_n(frq[1])

# A tibble: 9 x 3
# Groups:   grp [3]
     id   grp   frq
  <int> <int> <dbl>
1     3     1     3
2     4     1     3
3     2     1     3
4     6     2     2
5     8     2     2
6    13     3     4
7    14     3     4
8    12     3     4
9    11     3     4

Not sure why it didn't work when you tried it.不知道为什么当你尝试它时它不起作用。

library(tidyverse)

# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))

set.seed(22)

df %>%
  group_by(grp) %>%   # for each group
  nest() %>%          # nest data
  mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>%  # sample using id values and (unique) frq value
  unnest(v)           # unnest the sampled values

# # A tibble: 9 x 2
#     grp    id
#   <int> <int>
# 1     1     2
# 2     1     5
# 3     1     3
# 4     2     8
# 5     2     9
# 6     3    14
# 7     3    13
# 8     3    15
# 9     3    11

Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).如果您将 id(不是 id 向量)和一个频率值(对于每组)的数据帧作为输入传递,则函数sample_n起作用。

An alternative version using map2 and generating the inputs for sample_n in advance:使用map2并提前为sample_n生成输入的替代版本:

df %>%
  group_by(grp) %>%                                 # for every group
  summarise(d = list(data.frame(id=id)),            # create a data frame of ids
            frq = unique(frq)) %>%                  # get the unique frq value
  mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>%   # sample using data frame of ids and frq value
  unnest(v) %>%                                     # unnest sampled values
  select(-frq)                                      # remove frq column (if needed)

The following answer is not recommended, just shows a different approach without nests/maps that some people might find more comprehensible.不推荐以下答案,只是展示了一种不同的方法,没有嵌套/地图,有些人可能会觉得更容易理解。 Possibly of use to someone working with a smallish data set who wants to do something slightly different to the original question, is a bit scared or doesn't have time to play around with functions they don't really understand, and isn't too worried about efficiency.可能对处理小型数据集的人有用,他们想要做一些与原始问题略有不同的事情,有点害怕或没有时间玩他们并不真正理解的功能,而且不是太担心效率。 You just need to recall the behaviour of the original sample function in base R: when provided with a (positive) integer argument x , it outputs a vector randomly permuting the integers from 1:x .您只需要回忆基 R 中原始sample函数的行为:当提供(正)整数参数x ,它输出一个向量,随机排列1:x的整数。

> sample(5)
[1] 5 1 4 2 3

If we had five elements, we could then obtain a random sample of size three by only selecting the positions where 1, 2 and 3 were permuted - in this case we'd pick the second, fourth and fifth elements.如果我们有五个元素,那么我们可以通过只选择排列 1、2 和 3 的位置来获得大小为 3 的随机样本 - 在这种情况下,我们将选择第二、第四和第五个元素。 All clear?全清? Then similarly we can just do that within each group, assigning random integers from 1 to the group size, and choosing as our sample the places where the random id is less than or equal to the desired sample size for that group.然后类似地,我们可以在每个组内这样做,从 1 到组大小分配随机整数,并选择随机 id 小于或等于该组所需样本大小的地方作为我们的样本。

library(tidyverse)

# The iris data set has three different species
# I want to sample 2, 5 and 3 flowers respectively from each
sample_sizes <- data.frame(
  Species = unique(iris$Species),
  n_to_sample = c(2, 5, 3)
)

iris %>%
  left_join(sample_sizes, by = "Species") %>% # adds column for how many to sample from this species
  group_by(Species) %>% # each species is a group, the size of the group can be found by n()
  mutate(random_id = sample(n())) %>% # give each flower in the group a random id between 1 and n()
  ungroup() %>%
  filter(random_id <= n_to_sample)

Which gave me the output:这给了我输出:

# A tibble: 10 x 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    n_to_sample random_id
          <dbl>       <dbl>        <dbl>       <dbl> <fct>            <dbl>     <int>
 1          4.9         3.1          1.5         0.1 setosa               2         1
 2          5.7         4.4          1.5         0.4 setosa               2         2
 3          6.2         2.2          4.5         1.5 versicolor           5         3
 4          6.3         2.5          4.9         1.5 versicolor           5         2
 5          6.4         2.9          4.3         1.3 versicolor           5         5
 6          6           2.9          4.5         1.5 versicolor           5         4
 7          5.5         2.4          3.8         1.1 versicolor           5         1
 8          7.3         2.9          6.3         1.8 virginica            3         1
 9          7.2         3            5.8         1.6 virginica            3         3
10          6.2         3.4          5.4         2.3 virginica            3         2

You can of course pipe through to select(-random_id, -n_to_sample) if you no longer have any use for the final two columns, but I left them in so it's clearer from the output how the code worked.如果您不再对最后两列有任何用处,您当然可以通过管道传递到select(-random_id, -n_to_sample) ,但我将它们留在了,以便从输出中更清楚代码的工作方式。

For the example data given in the question:对于问题中给出的示例数据:

library(dplyr)
# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))

df %>%
  group_by(grp) %>%
  mutate(random_id = sample(n())) %>%
  ungroup() %>%
  filter(random_id <= frq) %>%
  select(-random_id)

# A tibble: 9 x 3
     id   grp   frq
  <int> <int> <dbl>
1     1     1     3
2     2     1     3
3     3     1     3
4     8     2     2
5     9     2     2
6    11     3     4
7    12     3     4
8    13     3     4
9    15     3     4

NB if you're a safety fanatic and x might be zero, and you want to guarantee the length of the output is definitely the same as x , you're better to do sample(seq_len(x)) than sample(x) .注意,如果您是安全狂热者并且x可能为零,并且您想保证输出的长度绝对与x相同,那么您最好执行sample(seq_len(x))不是sample(x) That way you get the zero-length vector integer(0) rather than the length-one vector 0 in the case where x is zero.这样,在x为零的情况下,您将获得零长度向量integer(0)而不是长度为 1 的向量0 In my code, the mutate will never be working on a row for which n() is zero (if n() were zero then that group is empty so there couldn't be a row there) and this isn't a problem.在我的代码中, mutate永远不会在n()为零的行上工作(如果n()为零,则该组为空,因此那里不可能有一行),这不是问题。 Just something to be aware of if you're taking this approach somewhere else.如果您在其他地方采用这种方法,则需要注意一些事情。


Benchmarks for comparison:比较基准:

f1 <- function(df) { # @AntoniosK with nest and map
  df %>%
    group_by(grp) %>%   # for each group
    nest() %>%          # nest data
    mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>%  # sample using id values and (unique) frq value
    unnest(v)           # unnest the sampled values
}

f2 <- function(df) { # @AntoniosK with nest and map2
  df %>%
    group_by(grp) %>%                                 # for every group
    summarise(d = list(data.frame(id=id)),            # create a data frame of ids
              frq = unique(frq)) %>%                  # get the unique frq value
    mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>%   # sample using data frame of ids and frq value
    unnest(v) %>%                                     # unnest sampled values
    select(-frq)                                      # remove frq column (if needed)
}

f3 <- function(df) { # @thc
  df %>% group_by(grp) %>% sample_n(frq[1])
}

f4 <- function(df) { # @Silverfish
  df %>%
    group_by(grp) %>%
    mutate(random_id = sample(n())) %>%
    ungroup() %>%
    filter(random_id <= frq) %>%
    select(-random_id)
}


# example data of variable size

df_n <- function(n) {
  data.frame(id = seq_len(3*n),
             grp = rep(1:3,each = n), 
             frq = rep(c(3,2,4), each = n))
}

require(microbenchmark)
microbenchmark(f1(df_n(1e3)), f2(df_n(1e3)), f3(df_n(1e3)), f4(df_n(1e3)),
               f1(df_n(1e6)), f2(df_n(1e6)), f3(df_n(1e6)), f4(df_n(1e6)),
               times=20)

Results strongly favour @thc's df %>% group_by(grp) %>% sample_n(frq[1]) both for data frame with a couple of thousand or couple of million rows.结果强烈支持@thc 的df %>% group_by(grp) %>% sample_n(frq[1])两者都具有几千或几百万行的数据框。 My naive approach takes two or three times as long, and @AntoniosK's faster solution is the one with nest and map2 (worse than mine for smaller data frames but better for the larger ones).我的天真方法需要两到三倍的时间,@AntoniosK 更快的解决方案是带有nestmap2解决方案(对于较小的数据帧比我的更糟糕,但对于较大的数据帧更好)。

Unit: milliseconds
            expr       min         lq        mean     median         uq       max neval
  f1(df_n(1000))   12.0007   12.27295   12.479760   12.34190   12.46475   13.6403    20
  f2(df_n(1000))    9.5841    9.82185    9.905120    9.87820    9.98865   10.2993    20
  f3(df_n(1000))    1.3729    1.53470    1.593015    1.56755    1.68910    1.8456    20
  f4(df_n(1000))    3.1732    3.21600    3.558855    3.27500    3.57350    5.4715    20
 f1(df_n(1e+06)) 1582.3807 1695.15655 1699.288195 1714.13435 1727.53300 1744.2654    20
 f2(df_n(1e+06))  323.3649  336.94280  407.581130  346.95390  463.69935  911.6647    20
 f3(df_n(1e+06))  216.3265  235.85830  268.756465  247.63620  259.02640  395.9372    20
 f4(df_n(1e+06))  641.5119  663.03510  737.089355  682.69730  803.98205 1132.6586    20

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM