简体   繁体   English

如何使用`sample_n`自动平衡`dplyr`中的数据集到最小类的大小?

[英]How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:我有一个数据集,如:

df <- tibble(
  id = 1:18,
  class = rep(c(rep(1,3),rep(2,2),3),3),
  var_a = rep(c("a","b"),9)
)

# A tibble: 18 x 3
      id cluster var_a
   <int>   <dbl> <chr>
 1     1       1 a    
 2     2       1 b    
 3     3       1 a    
 4     4       2 b    
 5     5       2 a    
 6     6       3 b    
 7     7       1 a    
 8     8       1 b    
 9     9       1 a    
10    10       2 b    
11    11       2 a    
12    12       3 b    
13    13       1 a    
14    14       1 b    
15    15       1 a    
16    16       2 b    
17    17       2 a    
18    18       3 b 

That dataset contains a number of observations in several classes.该数据集包含多个类别中的许多观察结果。 The classes are not balanced.班级不平衡。 In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.在上面的示例中,我们可以看到,只有 3 个观测值属于第 3 类,而第 2 类有 6 个观测值,第 1 类观测值有 9 个。

Now I want to automatically balance that dataset so that all classes are of the same size.现在我想自动平衡该数据集,以便所有类的大小相同。 So I want a dataset of 9 rows, 3 rows in each class.所以我想要一个 9 行的数据集,每个班级 3 行。 I can use the sample_n function from dplyr to do such a sampling.我可以使用dplyrsample_n函数来进行这样的采样。

I achieved to do so by first calculating the smallest class size..我通过首先计算最小的班级人数来做到这一点。

min_length <- as.numeric(df %>% 
  group_by(class) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  summarise(min = min(n)))

..and then apply the sample_n function: ..然后应用sample_n函数:

set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)

# A tibble: 9 x 3
# Groups:   cluster [3]
     id cluster var_a
  <int>   <dbl> <chr>
1    15       1 a    
2     7       1 a    
3    13       1 a    
4     4       2 b    
5     5       2 a    
6    17       2 a    
7    18       3 b    
8     6       3 b    
9    12       3 b    

I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?我想知道是否有可能一次性做到(计算最小的班级人数然后抽样)?

You can do it in one step, but it is cheating a little:你可以一步完成,但它有点作弊:

set.seed(42)
df %>%
  group_by(class) %>%
  sample_n(min(table(df$class))) %>%
  ungroup()
# # A tibble: 9 x 3
#      id class var_a
#   <int> <dbl> <chr>
# 1     1     1 a    
# 2     8     1 b    
# 3    15     1 a    
# 4     4     2 b    
# 5     5     2 a    
# 6    11     2 a    
# 7    12     3 b    
# 8    18     3 b    
# 9     6     3 b    

I say "cheating" because normally you would not want to reference df$ from within the pipe.我说“作弊”是因为通常你不想从管道中引用df$ However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.但是,因为我们要查找的它们的属性是整个框架的,而table函数一次只能看到一组,所以我们需要稍微回避一下。

One could do一个可以做

df %>%
  mutate(mn = min(table(class))) %>%
  group_by(class) %>%
  sample_n(mn[1]) %>%
  ungroup()
# # A tibble: 9 x 4
#      id class var_a    mn
#   <int> <dbl> <chr> <int>
# 1    14     1 b         3
# 2    13     1 a         3
# 3     7     1 a         3
# 4     4     2 b         3
# 5    16     2 b         3
# 6     5     2 a         3
# 7    12     3 b         3
# 8    18     3 b         3
# 9     6     3 b         3

Though I don't think that that is any more elegant/readable.虽然我不认为那更优雅/可读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM