[英]dplyr sample_n from a single group
我有一些數據,其中觀察次數的摘要如下所示:
# A tibble: 14 x 3
# Groups: status [2]
status year n
<dbl> <dbl> <int>
1 0 2010 4593
2 0 2011 10990
3 0 2012 27711
4 0 2013 99989
5 0 2014 95407
6 0 2015 89010
7 0 2016 72289
8 1 2010 584
9 1 2011 785
10 1 2012 640
11 1 2013 667
12 1 2014 377
13 1 2015 460
14 1 2016 104
其中一個組的等級明顯高於另一組的等級。 如何在不對 1 類做任何事情的情況下隨機抽樣 0 類。也就是說,我想保留所有 1 類觀測值,並通過 4593(這是該年的最小觀測數)對 0 類觀測值進行隨機采樣)
使用group_by(status, year)
然后使用sample_n()
不起作用,因為 4593 值大於 1 類組中的值。
我的數據的一些隨機樣本:
structure(list(status = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
year = c(2013, 2014, 2012, 2013, 2016, 2013, 2015, 2014,
2013, 2016, 2015, 2016, 2011, 2014, 2016, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2012, 2016, 2016, 2012, 2016, 2015,
2013, 2014, 2015, 2013, 2015, 2015, 2014, 2015, 2011, 2014,
2013, 2012, 2011, 2016, 2015, 2015, 2015, 2014, 2012, 2013,
2015, 2012, 2015, 2016, 2015, 2013, 2014, 2014, 2014, 2013,
2013, 2016, 2016, 2013, 2015, 2012, 2014, 2014, 2013, 2015,
2014, 2016, 2016, 2014, 2012, 2016, 2013, 2010, 2011, 2014,
2016, 2013, 2016, 2014, 2014, 2013, 2013, 2013, 2016, 2016,
2012, 2014, 2013, 2015, 2016, 2013, 2013, 2015, 2013, 2014,
2013, 2015, 2013, 2013, 2011, 2014, 2016, 2013, 2010, 2012,
2014, 2012, 2011, 2011, 2013, 2015, 2014, 2010, 2010, 2013,
2010, 2014, 2011, 2011, 2014, 2013, 2014, 2015, 2015, 2013,
2014, 2013, 2011, 2013, 2014, 2013, 2011, 2013, 2012, 2015,
2012, 2012, 2012, 2010, 2013, 2013, 2011, 2011, 2011, 2012,
2016, 2013, 2011, 2011, 2012, 2012, 2014, 2010, 2013, 2014,
2011, 2012, 2010, 2012, 2012, 2011, 2015, 2011, 2011, 2013,
2015, 2010, 2015, 2011, 2015, 2015, 2012, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2014, 2010, 2011, 2013, 2014, 2012,
2013, 2016, 2014, 2012, 2012, 2013, 2010, 2012, 2013, 2014,
2014, 2011)), groups = structure(list(status = c(0, 1), .rows = structure(list(
1:100, 101:200), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), row.names = c(NA, -200L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
我認為這會奏效。 dat
是您的示例數據框。 下面的代碼通過status
分割數據幀,然后使用imap
來評估是否需要采樣。 如果列表元素的名稱為"0"
,則進行采樣。 您可以將size = 1
更改為實際數據框的最小數量。
library(dplyr)
library(purrr)
dat2 <- dat %>%
split(f = .$status) %>%
imap(function(x, y){
if (y %in% "0"){
x <- x %>%
group_by(status, year) %>%
sample_n(size = 1)
}
return(x)
}) %>%
bind_rows()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.