[英]dplyr sample_n by group with unique size argument per group
I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.我试图从数据集中抽取一个分层样本,其中存在一个变量,表明每组的样本量应该有多大。
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
In this example, grp
refers to the group I want to sample by and frq
is the sample size specificied for that group.在这个例子中, grp
指的是我想要采样的组,而frq
是为该组指定的样本大小。
Using split
, I came up with this possible solution, which gives the desired result but seems rather inefficient :使用split
,我想出了这个可能的解决方案,它给出了想要的结果,但似乎效率很低:
s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>%
do.call(what = rbind)
Is there a way using just dplyr's group_by
and sample_n
to do this?有没有办法只使用 dplyr 的group_by
和sample_n
来做到这一点?
My first thought was:我的第一个想法是:
df %>% group_by(grp) %>% sample_n(size = frq)
but this gives the error:但这给出了错误:
Error in is_scalar_integerish(size) : object 'frq' not found is_scalar_integerish(size) 中的错误:找不到对象“frq”
This works:这有效:
df %>% group_by(grp) %>% sample_n(frq[1])
# A tibble: 9 x 3
# Groups: grp [3]
id grp frq
<int> <int> <dbl>
1 3 1 3
2 4 1 3
3 2 1 3
4 6 2 2
5 8 2 2
6 13 3 4
7 14 3 4
8 12 3 4
9 11 3 4
Not sure why it didn't work when you tried it.不知道为什么当你尝试它时它不起作用。
library(tidyverse)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
set.seed(22)
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
# # A tibble: 9 x 2
# grp id
# <int> <int>
# 1 1 2
# 2 1 5
# 3 1 3
# 4 2 8
# 5 2 9
# 6 3 14
# 7 3 13
# 8 3 15
# 9 3 11
Function sample_n
works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).如果您将 id(不是 id 向量)和一个频率值(对于每组)的数据帧作为输入传递,则函数sample_n
起作用。
An alternative version using map2
and generating the inputs for sample_n
in advance:使用map2
并提前为sample_n
生成输入的替代版本:
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
The following answer is not recommended, just shows a different approach without nests/maps that some people might find more comprehensible.不推荐以下答案,只是展示了一种不同的方法,没有嵌套/地图,有些人可能会觉得更容易理解。 Possibly of use to someone working with a smallish data set who wants to do something slightly different to the original question, is a bit scared or doesn't have time to play around with functions they don't really understand, and isn't too worried about efficiency.可能对处理小型数据集的人有用,他们想要做一些与原始问题略有不同的事情,有点害怕或没有时间玩他们并不真正理解的功能,而且不是太担心效率。 You just need to recall the behaviour of the original sample
function in base R: when provided with a (positive) integer argument x
, it outputs a vector randomly permuting the integers from 1:x
.您只需要回忆基 R 中原始sample
函数的行为:当提供(正)整数参数x
,它输出一个向量,随机排列1:x
的整数。
> sample(5)
[1] 5 1 4 2 3
If we had five elements, we could then obtain a random sample of size three by only selecting the positions where 1, 2 and 3 were permuted - in this case we'd pick the second, fourth and fifth elements.如果我们有五个元素,那么我们可以通过只选择排列 1、2 和 3 的位置来获得大小为 3 的随机样本 - 在这种情况下,我们将选择第二、第四和第五个元素。 All clear?全清? Then similarly we can just do that within each group, assigning random integers from 1 to the group size, and choosing as our sample the places where the random id is less than or equal to the desired sample size for that group.然后类似地,我们可以在每个组内这样做,从 1 到组大小分配随机整数,并选择随机 id 小于或等于该组所需样本大小的地方作为我们的样本。
library(tidyverse)
# The iris data set has three different species
# I want to sample 2, 5 and 3 flowers respectively from each
sample_sizes <- data.frame(
Species = unique(iris$Species),
n_to_sample = c(2, 5, 3)
)
iris %>%
left_join(sample_sizes, by = "Species") %>% # adds column for how many to sample from this species
group_by(Species) %>% # each species is a group, the size of the group can be found by n()
mutate(random_id = sample(n())) %>% # give each flower in the group a random id between 1 and n()
ungroup() %>%
filter(random_id <= n_to_sample)
Which gave me the output:这给了我输出:
# A tibble: 10 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_to_sample random_id
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <int>
1 4.9 3.1 1.5 0.1 setosa 2 1
2 5.7 4.4 1.5 0.4 setosa 2 2
3 6.2 2.2 4.5 1.5 versicolor 5 3
4 6.3 2.5 4.9 1.5 versicolor 5 2
5 6.4 2.9 4.3 1.3 versicolor 5 5
6 6 2.9 4.5 1.5 versicolor 5 4
7 5.5 2.4 3.8 1.1 versicolor 5 1
8 7.3 2.9 6.3 1.8 virginica 3 1
9 7.2 3 5.8 1.6 virginica 3 3
10 6.2 3.4 5.4 2.3 virginica 3 2
You can of course pipe through to select(-random_id, -n_to_sample)
if you no longer have any use for the final two columns, but I left them in so it's clearer from the output how the code worked.如果您不再对最后两列有任何用处,您当然可以通过管道传递到select(-random_id, -n_to_sample)
,但我将它们留在了,以便从输出中更清楚代码的工作方式。
For the example data given in the question:对于问题中给出的示例数据:
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
# A tibble: 9 x 3
id grp frq
<int> <int> <dbl>
1 1 1 3
2 2 1 3
3 3 1 3
4 8 2 2
5 9 2 2
6 11 3 4
7 12 3 4
8 13 3 4
9 15 3 4
NB if you're a safety fanatic and x
might be zero, and you want to guarantee the length of the output is definitely the same as x
, you're better to do sample(seq_len(x))
than sample(x)
.注意,如果您是安全狂热者并且x
可能为零,并且您想保证输出的长度绝对与x
相同,那么您最好执行sample(seq_len(x))
不是sample(x)
。 That way you get the zero-length vector integer(0)
rather than the length-one vector 0
in the case where x
is zero.这样,在x
为零的情况下,您将获得零长度向量integer(0)
而不是长度为 1 的向量0
。 In my code, the mutate
will never be working on a row for which n()
is zero (if n()
were zero then that group is empty so there couldn't be a row there) and this isn't a problem.在我的代码中, mutate
永远不会在n()
为零的行上工作(如果n()
为零,则该组为空,因此那里不可能有一行),这不是问题。 Just something to be aware of if you're taking this approach somewhere else.如果您在其他地方采用这种方法,则需要注意一些事情。
Benchmarks for comparison:比较基准:
f1 <- function(df) { # @AntoniosK with nest and map
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
}
f2 <- function(df) { # @AntoniosK with nest and map2
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
}
f3 <- function(df) { # @thc
df %>% group_by(grp) %>% sample_n(frq[1])
}
f4 <- function(df) { # @Silverfish
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
}
# example data of variable size
df_n <- function(n) {
data.frame(id = seq_len(3*n),
grp = rep(1:3,each = n),
frq = rep(c(3,2,4), each = n))
}
require(microbenchmark)
microbenchmark(f1(df_n(1e3)), f2(df_n(1e3)), f3(df_n(1e3)), f4(df_n(1e3)),
f1(df_n(1e6)), f2(df_n(1e6)), f3(df_n(1e6)), f4(df_n(1e6)),
times=20)
Results strongly favour @thc's df %>% group_by(grp) %>% sample_n(frq[1])
both for data frame with a couple of thousand or couple of million rows.结果强烈支持@thc 的df %>% group_by(grp) %>% sample_n(frq[1])
两者都具有几千或几百万行的数据框。 My naive approach takes two or three times as long, and @AntoniosK's faster solution is the one with nest
and map2
(worse than mine for smaller data frames but better for the larger ones).我的天真方法需要两到三倍的时间,@AntoniosK 更快的解决方案是带有nest
和map2
解决方案(对于较小的数据帧比我的更糟糕,但对于较大的数据帧更好)。
Unit: milliseconds
expr min lq mean median uq max neval
f1(df_n(1000)) 12.0007 12.27295 12.479760 12.34190 12.46475 13.6403 20
f2(df_n(1000)) 9.5841 9.82185 9.905120 9.87820 9.98865 10.2993 20
f3(df_n(1000)) 1.3729 1.53470 1.593015 1.56755 1.68910 1.8456 20
f4(df_n(1000)) 3.1732 3.21600 3.558855 3.27500 3.57350 5.4715 20
f1(df_n(1e+06)) 1582.3807 1695.15655 1699.288195 1714.13435 1727.53300 1744.2654 20
f2(df_n(1e+06)) 323.3649 336.94280 407.581130 346.95390 463.69935 911.6647 20
f3(df_n(1e+06)) 216.3265 235.85830 268.756465 247.63620 259.02640 395.9372 20
f4(df_n(1e+06)) 641.5119 663.03510 737.089355 682.69730 803.98205 1132.6586 20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.