Sample_n to Get a Maximum Number From Each Group

Question

Using this very simple data example below, my goal would be to sample all 3 of A and only sample 5 out of 7 of B .

 id   group
  1       A
  2       A
  3       A
  4       B
  5       B
  6       B
  7       B
  8       B
  9       B
 10       B

ex_df <- data.frame(id = 1:10, group = c(rep("A", 3), rep("B", 7)))

Now, normally it'd just be a case of using sample_n from dplyr such that the code would be along the lines of

sel_5 <- ex_df %>%
   group_by(group) %>%
   sample_n(5)

Except this gives the error (for obvious reasons)

Error: size must be less or equal than 2 (size of data), set replace = TRUE to use sampling with replacement

but sampling with replacement isn't an option. Is there any way that I might be able to set the sample_n size to be the minimum of 5 or the size of the group?

Or maybe another function that I'm unaware of that would be capable of this?

Answer 1

I've had the same problem, and here's what I did.

library(dplyr)

split_up <- split(ex_df, f = ex_df$group)
#split original dataframe into a list of dataframes for each unique group

sel_5 <- lapply(split_up, function(x) {x %>% sample_n(ifelse(nrow(x) < 5, nrow(x), 5))})
#on each dataframe, subsample to 5 or to the number of rows if there are less than 5

sel_5 <- do.call("rbind", sel_5)
#bind it back up!

Sample_n to Get a Maximum Number From Each Group

Question

1 answers

solution1
1 2020-01-24 17:41:39

Sample_n to Get a Maximum Number From Each Group

Question

1 answers

solution1 1 2020-01-24 17:41:39

solution1
1 2020-01-24 17:41:39