[英]dplyr sample_n from a single group
我有一些数据,其中观察次数的摘要如下所示:
# A tibble: 14 x 3
# Groups: status [2]
status year n
<dbl> <dbl> <int>
1 0 2010 4593
2 0 2011 10990
3 0 2012 27711
4 0 2013 99989
5 0 2014 95407
6 0 2015 89010
7 0 2016 72289
8 1 2010 584
9 1 2011 785
10 1 2012 640
11 1 2013 667
12 1 2014 377
13 1 2015 460
14 1 2016 104
其中一个组的等级明显高于另一组的等级。 如何在不对 1 类做任何事情的情况下随机抽样 0 类。也就是说,我想保留所有 1 类观测值,并通过 4593(这是该年的最小观测数)对 0 类观测值进行随机采样)
使用group_by(status, year)
然后使用sample_n()
不起作用,因为 4593 值大于 1 类组中的值。
我的数据的一些随机样本:
structure(list(status = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
year = c(2013, 2014, 2012, 2013, 2016, 2013, 2015, 2014,
2013, 2016, 2015, 2016, 2011, 2014, 2016, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2012, 2016, 2016, 2012, 2016, 2015,
2013, 2014, 2015, 2013, 2015, 2015, 2014, 2015, 2011, 2014,
2013, 2012, 2011, 2016, 2015, 2015, 2015, 2014, 2012, 2013,
2015, 2012, 2015, 2016, 2015, 2013, 2014, 2014, 2014, 2013,
2013, 2016, 2016, 2013, 2015, 2012, 2014, 2014, 2013, 2015,
2014, 2016, 2016, 2014, 2012, 2016, 2013, 2010, 2011, 2014,
2016, 2013, 2016, 2014, 2014, 2013, 2013, 2013, 2016, 2016,
2012, 2014, 2013, 2015, 2016, 2013, 2013, 2015, 2013, 2014,
2013, 2015, 2013, 2013, 2011, 2014, 2016, 2013, 2010, 2012,
2014, 2012, 2011, 2011, 2013, 2015, 2014, 2010, 2010, 2013,
2010, 2014, 2011, 2011, 2014, 2013, 2014, 2015, 2015, 2013,
2014, 2013, 2011, 2013, 2014, 2013, 2011, 2013, 2012, 2015,
2012, 2012, 2012, 2010, 2013, 2013, 2011, 2011, 2011, 2012,
2016, 2013, 2011, 2011, 2012, 2012, 2014, 2010, 2013, 2014,
2011, 2012, 2010, 2012, 2012, 2011, 2015, 2011, 2011, 2013,
2015, 2010, 2015, 2011, 2015, 2015, 2012, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2014, 2010, 2011, 2013, 2014, 2012,
2013, 2016, 2014, 2012, 2012, 2013, 2010, 2012, 2013, 2014,
2014, 2011)), groups = structure(list(status = c(0, 1), .rows = structure(list(
1:100, 101:200), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), row.names = c(NA, -200L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
我认为这会奏效。 dat
是您的示例数据框。 下面的代码通过status
分割数据帧,然后使用imap
来评估是否需要采样。 如果列表元素的名称为"0"
,则进行采样。 您可以将size = 1
更改为实际数据框的最小数量。
library(dplyr)
library(purrr)
dat2 <- dat %>%
split(f = .$status) %>%
imap(function(x, y){
if (y %in% "0"){
x <- x %>%
group_by(status, year) %>%
sample_n(size = 1)
}
return(x)
}) %>%
bind_rows()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.