R: Representative random sampling for 150 values from categories with different group size

I face the problem that I want to have 150 randomly drawn samples from a dataset based on two categories "site" and "species". So, ideally, we have an outcome of 30 samples per site where each species is more or less equally distributed.

Reproducible example:

df <- data.frame(site = rep(c("A", "B", "C", "D", "E"), each = 10), species = c("s1", rep("s2", each = 3), rep("s3", each = 16), rep("s4", each = 13), rep("s5", each = 17)), individual = c(1, 1:3, 1:16, 1:13, 1:17) )

I think using the dplyr function group_by(site, species) and slice_sample() are a good approach which would however sample a certain amount per group and not 150 in total.. Another problem I have now is that slice_sample needs at least the n-amount of samples in each group to work. This is not always given. So, is there a possibility of sampling 150 in total and whenever the desired amount to sample per group is not provided, then just sample others for compensation?


One option is to nest_by(site) and then use slice_sample() to draw a sample of 30 from each group. If needed we can use tidyr::unnest() to get one "normal" data.frame containing all samples drawn.

The problem is probably the condition that:

where each species is more or less equally distributed

When we look at your site s we can see that most of the site only have one species. So drawing samples from your original data will lead to specific sites only containing a certain species . Alternatively, we could just sample species and assign a site randomly independent of the fact that this species has never been observed there.


site_sample <- df %>% 
  nest_by(site) %>% 
  summarise(data = list(slice_sample(data, n = 30, replace = TRUE)))
#> `summarise()` has grouped output by 'site'. You can override using the `.groups`
#> argument.

#> # A tibble: 5 x 2
#> # Groups:   site [5]
#>   site  data             
#>   <chr> <list>           
#> 1 A     <tibble [30 x 2]>
#> 2 B     <tibble [30 x 2]>
#> 3 C     <tibble [30 x 2]>
#> 4 D     <tibble [30 x 2]>
#> 5 E     <tibble [30 x 2]>

site_sample %>% 
#> # A tibble: 150 x 3
#> # Groups:   site [5]
#>    site  species individual
#>    <chr> <chr>        <dbl>
#>  1 A     s1               1
#>  2 A     s3               1
#>  3 A     s1               1
#>  4 A     s3               5
#>  5 A     s3               3
#>  6 A     s3               4
#>  7 A     s2               2
#>  8 A     s3               3
#>  9 A     s3               5
#> 10 A     s3               2
#> # ... with 140 more rows

original data

df <- data.frame(site = rep(c("A", "B", "C", "D", "E"), each = 10), species = c("s1", rep("s2", each = 3), rep("s3", each = 16), rep("s4", each = 13), rep("s5", each = 17)), individual = c(1, 1:3, 1:16, 1:13, 1:17) ) 

