简体   繁体   English

采样数据框保持所有级别的因子变量

[英]Sampling data frames maintaining all levels of factor variables

I need to sample a data frame maintaining all levels of factors in the outcome.我需要对一个数据框进行采样,以维护结果中的所有级别的因素。 I then want to get the complement of this sample–ie, those rows that aren't part of the sample.然后我想得到这个样本的补集——即那些不属于样本的行。 My end goal is to create both a training and a test sample for regression analyses.我的最终目标是为回归分析创建训练和测试样本。 To do that successfully, I need to ensure that all levels of the factor variables are represented on the training sample.为了成功地做到这一点,我需要确保所有级别的因子变量都在训练样本上表示出来。

The approach I've tried (sample code below) was using dplyr::group_by combined with dplyr::slice_sample and then dplyr::anti_join to obtain the test sample.我尝试过的方法(下面的示例代码)是使用 dplyr::group_by 结合 dplyr::slice_sample 然后 dplyr::anti_join 来获得测试。 It's not working, for some reason.由于某种原因,它不起作用。 Either I'm missing something about how these functions are supposed to work or they're not behaving as expected.要么我遗漏了有关这些功能应该如何工作的信息,要么它们的行为不符合预期。

I've also tried approaches based on this question .我也尝试过基于这个问题的方法。 They didn't work because (1) I need to guarantee that all levels of multiple factors are represented and (2) i want to select a proportion of the observations, not a specific number.他们没有工作,因为(1)我需要保证所有级别的多个因素都得到了代表,(2)我想 select 观察的一部分,而不是特定的数字。

Sample Code示例代码

> library(tidyverse) 
> 
> set.seed(72)
> 
> data <- tibble(y = rnorm(100), x1 = rnorm(100), 
+   x2 = sample(letters, 100, T), x3 = sample(LETTERS, 100, T))
> data
# A tibble: 100 x 4
         y     x1 x2    x3   
     <dbl>  <dbl> <chr> <chr>
 1  1.37   -0.737 c     C    
 2  1.16    1.66  c     T    
 3  0.0344 -0.319 q     P    
 4  1.03   -0.963 k     C    
 5  0.636   0.961 i     H    
 6  0.319   0.761 g     L    
 7  0.216   0.860 u     M    
 8  1.31    0.887 g     M    
 9 -0.594   2.70  m     I    
10 -0.542   0.517 u     C    
# … with 90 more rows
> 
> train_data <- data %>%
+   group_by(x2, x3) %>%
+   slice_sample(prop = .7)
> train_data # clearly this is not what I want 
# A tibble: 8 x 4
# Groups:   x2, x3 [8]
       y     x1 x2    x3   
   <dbl>  <dbl> <chr> <chr>
1  1.23  -0.297 c     A    
2  1.11   0.689 e     O    
3  0.559  0.353 e     Z    
4 -1.65  -1.71  l     M    
5 -0.777  1.31  l     X    
6  0.784  0.309 s     E    
7  0.755 -0.362 u     X    
8 -0.768  0.292 v     H    
> 
> test_data <- data %>%
+   anti_join(train_data)
Joining, by = c("y", "x1", "x2", "x3")
> test_data # my goal was that the training data would have 70%  and the test data would have around 30% of the full sample.
# A tibble: 92 x 4
         y     x1 x2    x3   
     <dbl>  <dbl> <chr> <chr>
 1  1.37   -0.737 c     C    
 2  1.16    1.66  c     T    
 3  0.0344 -0.319 q     P    
 4  1.03   -0.963 k     C    
 5  0.636   0.961 i     H    
 6  0.319   0.761 g     L    
 7  0.216   0.860 u     M    
 8  1.31    0.887 g     M    
 9 -0.594   2.70  m     I    
10 -0.542   0.517 u     C    
# … with 82 more rows
> 
> reg <- lm(y ~ x1 + x2 + x3, train_data)
> predict(reg, newdata = test_data) # I obviously still have the same problem
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor x2 has new levels a, b, d, f, g, h, i, j, k, m, n, o, p, q, r, t, w, x, y, z
> 
> 

I had to extend your data to 10,000 rows to get a reasonable number of observations per combination of categorical variable.我不得不将您的数据扩展到 10,000 行,以获得每个分类变量组合的合理数量的观察值。 Then, I used nest_by() from dplyr (version 1.0.1) and sampled each subset.然后,我使用了nest_by() (版本 1.0.1)中的dplyr () 并对每个子集进行了采样。

library(dplyr)    
set.seed(72)
data <- tibble(y = rnorm(10000), x1 = rnorm(10000), 
               x2 = sample(letters, 10000, T), x3 = sample(LETTERS, 10000, T)) 
train <- data %>% 
    nest_by(x2, x3, .key = "xy") %>% 
    mutate(sample = list(xy[sample(1:nrow(xy), 
                                   size = round(0.7*nrow(xy))),])) %>%
    select(-xy) %>%
    summarize(sample)
train
# A tibble: 6,975 x 4
# Groups:   x2, x3 [676]
   x2    x3         y      x1
   <chr> <chr>  <dbl>   <dbl>
 1 a     A     -0.539 -1.22  
 2 a     A     -0.664  0.453 
 3 a     A     -1.32  -0.831 
 4 a     A      0.765  0.258 
 5 a     A     -0.462  0.764 
 6 a     A      1.86  -0.0400
 7 a     A     -1.15   1.02  
 8 a     A      0.244 -0.823 
 9 a     A     -0.277 -0.744 
10 a     A      0.221 -0.292 
# ... with 6,965 more rows
test <- data%>%
    anti_join(train)
test
# A tibble: 3,025 x 4
       y     x1 x2    x3   
    <dbl>  <dbl> <chr> <chr>
 1  0.636  1.71  b     P    
 2  0.319 -0.851 b     K    
 3  1.31  -1.61  r     A    
 4 -1.03   0.436 a     B    
 5 -0.672 -1.43  g     O    
 6 -1.42  -0.637 l     L    
 7  0.879 -1.78  t     G    
 8  0.935 -1.44  g     C    
 9 -2.21  -0.842 v     F    
10 -1.00  -2.40  i     D    
# ... with 3,015 more rows

I can run your lm() and predict() without error.我可以运行你的lm()predict()而不会出错。

Here is a slightly different way to make train if you have an older version of dplyr .如果您有旧版本的dplyr ,这里有一种稍微不同的方法来制作train

library(dplyr)
library(tidyr)
library(purrr)
train <-data %>%
  nest(x2, x3) %>%
  mutate(sample = map(data, function(df) {df[sample(1:nrow(df), round(0.7*nrow(df))),]}) %>%
  select(-data) %>%
  unnest(sample)

There is nothing wrong with your code/approach.您的代码/方法没有任何问题。 You do not have enough observations.你没有足够的观察。 There are lot of groups with only 1 row in them, which when sampled with 0.7 proportion rounds it down to 0. If you change the sample to 1000 rows, the same code works fine without error.有很多组只有 1 行,当以 0.7 比例采样时,会将其舍入为 0。如果将样本更改为 1000 行,相同的代码可以正常工作而不会出错。

library(dplyr)
data <- tibble(y = rnorm(1000), x1 = rnorm(1000), 
                  x2 = sample(letters, 1000, T), x3 = sample(LETTERS, 1000, T))
train_data <- data %>%
  group_by(x2, x3) %>%
  slice_sample(prop = 0.7) 

test_data <- data %>%  anti_join(train_data)

reg <- lm(y ~ x1 + x2 + x3, train_data)
predict(reg, newdata = test_data)

If in your real data you have groups with as low as only 1 row, you can sample them such that it selects max of 1 or (0.7*number of rows in group).如果在您的真实数据中,您的组只有 1 行,您可以对它们进行采样,使其选择max为 1 或(0.7 * 组中的行数)。

train_data <- data %>% group_by(x2, x3) %>% sample_n(max(0.7*n(), 1))

(Used sample_n here since I couldn't use n() in slice_sample ). (这里使用了sample_n ,因为我不能在slice_sample中使用n() )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM