简体   繁体   English

R中的有效随机抽样

[英]Efficient random sampling in R

From a data frame, I am trying randomly sample 1:20 observations where for each number of observation I would like to replicate the process 4 times. 从数据框架,我正在尝试随机抽样1:20观察,对于每个观察次数我想要复制过程4次。 I came up with this working solution, but it is very slow since it is involving coping many times a large data frame because of the crossing() function. 我提出了这个有效的解决方案,但由于crossing()函数涉及多次处理大数据帧,所以它非常慢。 Anyone can point me toward a more efficient solution? 任何人都可以指出我更有效的解决方案?

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 
  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

    data %>%
      sample_n(n, replace = TRUE) %>%
      summarise(mean_mpg = mean(mpg)) %>%
      pull(mean_mpg)

  }))
#> # A tibble: 240 x 5
#>      cyl data              n_random_sample n_replicate   res
#>    <dbl> <list>                      <int>       <int> <dbl>
#>  1     6 <tibble [7 × 10]>               1           1  17.8
#>  2     6 <tibble [7 × 10]>               1           2  21  
#>  3     6 <tibble [7 × 10]>               1           3  19.2
#>  4     6 <tibble [7 × 10]>               1           4  18.1
#>  5     6 <tibble [7 × 10]>               2           1  19.6
#>  6     6 <tibble [7 × 10]>               2           2  19.4
#>  7     6 <tibble [7 × 10]>               2           3  19.6
#>  8     6 <tibble [7 × 10]>               2           4  20.4
#>  9     6 <tibble [7 × 10]>               3           1  20.1
#> 10     6 <tibble [7 × 10]>               3           2  18.9
#> # ... with 230 more rows

Created on 2018-11-19 by the reprex package (v0.2.1) reprex包创建于2018-11-19(v0.2.1)

EDIT: I am now working with a much larger dataset. 编辑:我现在正在使用更大的数据集。 Would it be possible to do it more efficiently with data.table? 是否可以使用data.table更有效地完成它?

This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map : 这是一种替代解决方案,它使您的原始数据集子集并使用函数选择行样本,而不是使用nest来创建子数据集并将它们存储为列表变量,然后使用map选择样本:

library(tidyverse)

# create function to sample rows
f = function(c, n) {
  mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
}

# vectorise function
f = Vectorize(f)

# set seed for reproducibility
set.seed(11)

tbl_df(mtcars) %>%
  distinct(cyl) %>%
  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
  mutate(res = f(cyl, n_random_sample))

# # A tibble: 240 x 4
#     cyl n_random_sample n_replicate   res
#   <dbl>           <int>       <int> <dbl>
# 1     6               1           1  21  
# 2     6               1           2  21  
# 3     6               1           3  18.1
# 4     6               1           4  21  
# 5     6               2           1  20.4
# 6     6               2           2  21.2
# 7     6               2           3  20.4
# 8     6               2           4  19.6
# 9     6               3           1  18.4
#10     6               3           2  19.6
# # ... with 230 more rows
mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times. 这将为您提供nrows = 1:20的表格列表,每次4次。

You can follow up with this to name the elements of the list: 您可以按照此操作来命名列表的元素:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result: 结果:

head(mm,5)
$`sample.1-1`
              mpg cyl disp  hp drat    wt qsec vs am gear carb
Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2

$`sample.2-1`
              mpg cyl disp  hp drat   wt qsec vs am gear carb
Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

$`sample.3-1`
             mpg cyl disp hp drat    wt  qsec vs am gear carb
Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2

$`sample.4-1`
               mpg cyl  disp hp drat    wt  qsec vs am gear carb
Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1

$`sample.1-2`
              mpg cyl disp  hp drat   wt qsec vs am gear carb
Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM