隨機抽樣組

Question

給定一個名為group的列的數據幀df ，您如何在 dplyr 中從中隨機抽取k組？ 它應該返回k組中的所有行（假設df$group中至少有k唯一值），並且df每個組都應該同樣有可能被返回。

Answer 1

只需使用sample()選擇一些組

iris %>% filter(Species %in% sample(levels(Species),2))

Answer 2

如果您使用 dplyr，我認為這種方法最有意義：

iris_grouped <- iris %>% 
  group_by(Species) %>% 
  nest()

其中產生：

# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

然后您可以使用sample_n ：

iris_grouped %>%
  sample_n(2)

# A tibble: 2 x 2
  Species    data             
  <fct>      <list>           
1 virginica  <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>

Answer 3

請注意，使用dplyr比常規數據幀操作慢得多：

library(microbenchmark)
microbenchmark(dplyr= iris %>% filter(Species %in% sample(levels(Species),2)),
               base= iris[iris[["Species"]] %in% sample(levels(iris[["Species"]]), 2),])

Unit: microseconds
  expr     min      lq     mean  median       uq      max neval cld
 dplyr 660.287 710.655 753.6704 722.629 771.2860 1122.527   100   b
  base  83.629  95.032 110.0936 106.057 119.1715  199.949   100  a

注意[[已知比$快，盡管兩者都有效

Answer 4

我真的很喜歡 Tristan Mahr在這里描述的方法。 我從博客中復制了他的函數，用於以下示例：

library(tidyverse)

sample_n_of <- function(data, size, ...) {
  dots <- quos(...)
  
  group_ids <- data %>% 
    group_by(!!! dots) %>% 
    group_indices()
  
  sampled_groups <- sample(unique(group_ids), size)
  
  data %>% 
    filter(group_ids %in% sampled_groups)
}

set.seed(1234)
mpg %>% 
  sample_n_of(size = 2, model)
#> # A tibble: 12 x 11
#>    manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class 
#>    <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr> 
#>  1 audi         a6 qua~   2.8  1999     6 auto(l~ 4        15    24 p     midsi~
#>  2 audi         a6 qua~   3.1  2008     6 auto(s~ 4        17    25 p     midsi~
#>  3 audi         a6 qua~   4.2  2008     8 auto(s~ 4        16    23 p     midsi~
#>  4 ford         mustang   3.8  1999     6 manual~ r        18    26 r     subco~
#>  5 ford         mustang   3.8  1999     6 auto(l~ r        18    25 r     subco~
#>  6 ford         mustang   4    2008     6 manual~ r        17    26 r     subco~
#>  7 ford         mustang   4    2008     6 auto(l~ r        16    24 r     subco~
#>  8 ford         mustang   4.6  1999     8 auto(l~ r        15    21 r     subco~
#>  9 ford         mustang   4.6  1999     8 manual~ r        15    22 r     subco~
#> 10 ford         mustang   4.6  2008     8 manual~ r        15    23 r     subco~
#> 11 ford         mustang   4.6  2008     8 auto(l~ r        15    22 r     subco~
#> 12 ford         mustang   5.4  2008     8 manual~ r        14    20 p     subco~

^{由reprex 包(v0.3.0) 於 2021 年 3 月 24 日創建}

Answer 5

我也有使用嵌套的 Oscar 代碼問題。 但是當我更新到 nest()、unnest() 和 slice_sample() 的最新語法時，它起作用了。

下面是一個替代版本，如果輸入框是按組變量排列的，它將產生相同的答案。 否則，答案將與平均水平一樣好。 與嵌套版本相比，此版本有幾個優點： 1. 最終數據框具有原始順序的列； 相比之下，嵌套版本將分組變量放在首位。 2：中間結果在調試時更容易閱讀，因為它們是普通的舊列表。

我有興趣對帶有替換的原始組數進行抽樣，就像在集群引導中一樣。 可以很容易地添加更多參數，使函數更通用。

# function to compute a clustered bootstrap sample
samplebygroups <- function(df, groupvar){
  datalist <- df %>%
    group_by({{ groupvar }}) %>%
    group_split
  n <- length(datalist)
  samplegroups <- sample(n, replace = TRUE)
  datalist[samplegroups] %>%
    bind_rows
}

這是一個示例運行

smallcars <- mtcars %>%  
  rownames_to_column(var = "Model") %>% 
  tail(5) %>%
  arrange(cyl) %>%
  select(Model, cyl, mpg)

 set.seed(1000)
 samplebygroups(smallcars, cyl)

帶輸出

# A tibble: 5 x 3
  Model            cyl   mpg
  <chr>          <dbl> <dbl>
1 Ford Pantera L     8  15.8
2 Maserati Bora      8  15  
3 Ferrari Dino       6  19.7
4 Ford Pantera L     8  15.8
5 Maserati Bora      8  15

使用 Oscar 的代碼，您將獲得完全相同的行，但 cyl 將是第一列。

隨機抽樣組

問題描述

5 個解決方案

解決方案1
20 已采納 2016-05-10 22:04:43

解決方案2
8 2018-10-25 03:30:50

解決方案3
2 2016-05-10 23:01:20

解決方案4
1 2021-03-25 01:56:29

解決方案5
0 2021-11-20 04:13:58

隨機抽樣組

問題描述

5 個解決方案

解決方案1 20 已采納 2016-05-10 22:04:43

解決方案2 8 2018-10-25 03:30:50

解決方案3 2 2016-05-10 23:01:20

解決方案4 1 2021-03-25 01:56:29

解決方案5 0 2021-11-20 04:13:58

解決方案1
20 已采納 2016-05-10 22:04:43

解決方案2
8 2018-10-25 03:30:50

解決方案3
2 2016-05-10 23:01:20

解決方案4
1 2021-03-25 01:56:29

解決方案5
0 2021-11-20 04:13:58