简体   繁体   中英

dplyr sample_n from a single group

I have some data where the summary of the number of observations looks like:

# A tibble: 14 x 3
# Groups:   status [2]
   status  year     n
    <dbl> <dbl> <int>
 1      0  2010  4593
 2      0  2011 10990
 3      0  2012 27711
 4      0  2013 99989
 5      0  2014 95407
 6      0  2015 89010
 7      0  2016 72289
 8      1  2010   584
 9      1  2011   785
10      1  2012   640
11      1  2013   667
12      1  2014   377
13      1  2015   460
14      1  2016   104

Where the class of one group is signficantly higher than the class of another group. How can I randomly sample the class of 0 without doing anything to the class of 1. That is, I would like to keep all class 1 observations and randomly sample the class 0 observations by 4593 (which is the minimum number of observations for that year)

Using group_by(status, year) and then sample_n() doesn't work since the 4593 value is greater than the values in the class 1 group.

Some random sample of my data:

    structure(list(status = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    year = c(2013, 2014, 2012, 2013, 2016, 2013, 2015, 2014, 
    2013, 2016, 2015, 2016, 2011, 2014, 2016, 2012, 2013, 2012, 
    2014, 2014, 2012, 2012, 2012, 2016, 2016, 2012, 2016, 2015, 
    2013, 2014, 2015, 2013, 2015, 2015, 2014, 2015, 2011, 2014, 
    2013, 2012, 2011, 2016, 2015, 2015, 2015, 2014, 2012, 2013, 
    2015, 2012, 2015, 2016, 2015, 2013, 2014, 2014, 2014, 2013, 
    2013, 2016, 2016, 2013, 2015, 2012, 2014, 2014, 2013, 2015, 
    2014, 2016, 2016, 2014, 2012, 2016, 2013, 2010, 2011, 2014, 
    2016, 2013, 2016, 2014, 2014, 2013, 2013, 2013, 2016, 2016, 
    2012, 2014, 2013, 2015, 2016, 2013, 2013, 2015, 2013, 2014, 
    2013, 2015, 2013, 2013, 2011, 2014, 2016, 2013, 2010, 2012, 
    2014, 2012, 2011, 2011, 2013, 2015, 2014, 2010, 2010, 2013, 
    2010, 2014, 2011, 2011, 2014, 2013, 2014, 2015, 2015, 2013, 
    2014, 2013, 2011, 2013, 2014, 2013, 2011, 2013, 2012, 2015, 
    2012, 2012, 2012, 2010, 2013, 2013, 2011, 2011, 2011, 2012, 
    2016, 2013, 2011, 2011, 2012, 2012, 2014, 2010, 2013, 2014, 
    2011, 2012, 2010, 2012, 2012, 2011, 2015, 2011, 2011, 2013, 
    2015, 2010, 2015, 2011, 2015, 2015, 2012, 2012, 2013, 2012, 
    2014, 2014, 2012, 2012, 2014, 2010, 2011, 2013, 2014, 2012, 
    2013, 2016, 2014, 2012, 2012, 2013, 2010, 2012, 2013, 2014, 
    2014, 2011)), groups = structure(list(status = c(0, 1), .rows = structure(list(
    1:100, 101:200), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr"))), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), row.names = c(NA, -200L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

I think this will work. dat is your example data frame. The code below split the data frame by status , and then use imap to evaluate if sampling is needed. If the name of the list element is "0" , it will conduct sampling. You can change the size = 1 to the minimum number of your real-world data frame.

library(dplyr)
library(purrr)

dat2 <- dat %>%
  split(f = .$status) %>%
  imap(function(x, y){
    if (y %in% "0"){
      x <- x %>% 
        group_by(status, year) %>%
        sample_n(size = 1) 
    }
    return(x)
  }) %>%
  bind_rows()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM