简体   繁体   English

R dplyr:Bootstrap 或随机抽样

[英]R dplyr: Bootstrap or random sampling

I have a dataset like this:我有一个这样的数据集:

   values  Pop1
1  611648  Nafr
2  322513  Nafr
3  381089  Jud
4   16941  Jud
5   21454  Jud
6  658802  Jud

I am summarizing the values with the command line:我正在用命令行总结这些值:

df %>% group_by(Pop1) %>% summarize(Mean = mean(x = values))

so that I have the mean for Pop1=Nafr and for Pop1=Jud .这样我就有了Pop1=NafrPop1=Jud的平均值。

Before summarizing, I would like to randomly sample the same number of row (50) in each of the two populations (Pop1).在总结之前,我想在两个种群 (Pop1) 中的每一个中随机采样相同数量的行 (50)。

I found the sample_n() function, which is great.我找到了sample_n()函数,这很棒。

df %>% group_by(Pop1) %>% sample_n(size=50) %>% summarize(Mean = mean(x = values))

But I would like to run it 100 times, creating a big df, and then summarize.但是我想运行100次,创建一个大df,然后总结。

Is there a way to add something to my above command line to create a table, where there is 100 times a sampling of 50 rows from the df, adding the column bs, corresponding to the 100 random samplings.有没有办法在我上面的命令行中添加一些东西来创建一个表,其中有 100 次来自 df 的 50 行的采样,添加列 bs,对应于 100 个随机采样。 Something that look like this:看起来像这样的东西:

       bs   values  Pop1
    1  1   611648  Nafr
    2  1   322513  Nafr
    3  1   381089  Jud
    4  1    16941  Jud
    5  1    21454  Jud
    6  1   658802  Jud
...
    1  100   611648  Nafr
    2  100   322513  Nafr
    3  100   381089  Jud
    4  100    16941  Jud
    5  100    21454  Jud
    6  100   658802  Jud

Then I could run new_df %>% group_by(bs, Pop1) %>% summarize(Mean = mean(x = values)) to get my summary, but also use the table for making plots.然后我可以运行new_df %>% group_by(bs, Pop1) %>% summarize(Mean = mean(x = values))来获得我的摘要,但也可以使用表格来制作绘图。

Thanks!谢谢!

You can use purrr::map_dfr to create a data.frame of the selected samples that'll be binded by rows, then you can use the command you provided to get the summary:您可以使用purrr::map_dfr创建data.frame行绑定的所选样本的data.frame ,然后您可以使用您提供的命令来获取摘要:

purrr::map_dfr(integer(100), ~ df %>% sample_n(size=50), .id="obs") -> new_df

new_df
#> # A tibble: 5,000 x 3
#>    obs   values Pop1 
#>    <chr>  <int> <fct>
#>  1 1     381089 Jud  
#>  2 1     658802 Jud  
#>  3 1     381089 Jud  
#>  4 1     611648 Nafr 
#>  5 1     381089 Jud  
#>  6 1      21454 Jud  
#>  7 1     611648 Nafr 
#>  8 1     381089 Jud  
#>  9 1      21454 Jud  
#> 10 1     322513 Nafr 
#> # … with 4,990 more rows
 new_df %>% group_by(obs, Pop1) %>% summarize(Mean = mean(x = values))
#`summarise()` regrouping output by 'obs' (override with `.groups` argument)
# A tibble: 200 x 3
# Groups:   obs [100]
   obs   Pop1     Mean
   <chr> <fct>   <dbl>
 1 1     Jud   261302.
 2 1     Nafr  451017.
 3 10    Jud   303711.
 4 10    Nafr  474689.
 5 100   Jud   236533.
 6 100   Nafr  492592.
 7 11    Jud   279812.
 8 11    Nafr  425776.
 9 12    Jud   279725.
10 12    Nafr  455960.
# … with 190 more rows

data数据

read.table(text= "values  Pop1
611648  Nafr
322513  Nafr
381089  Jud
16941  Jud
21454  Jud
658802  Jud", header=T)->df
tibble(df[rep(1:6, times=5, each=10),])->df

One way you could do this is working with nested tibbles and map from the purrr package:一种方法是使用purrr包中的嵌套 tibbles 和map

library(tidyverse)

df %>% nest(df = everything()) %>%
  slice(rep(1, 100)) %>%
  mutate(bs = 1:100) %>%
  mutate(df_sum = map(df, ~.x%>% group_by(Pop1) %>%
                    sample_n(size=50) %>% 
                    summarize(Mean = mean(x = values)))) %>%
  unnest(df_sum)

Or if you just want a way to stack your data 100 times you can use slice:或者,如果您只是想要一种将数据堆叠 100 次的方法,则可以使用切片:

df %>% slice(rep(1:n(), 100)) 

Try this尝试这个

library(tidyr)
df %>% expand(bs = 1:100, nesting(values, Pop1)) 

Output输出

# A tibble: 600 x 3
      bs values Pop1 
   <int>  <dbl> <chr>
 1     1  16941 Jud  
 2     1  21454 Jud  
 3     1 322513 Nafr 
 4     1 381089 Jud  
 5     1 611648 Nafr 
 6     1 658802 Jud  
 7     2  16941 Jud  
 8     2  21454 Jud  
 9     2 322513 Nafr 
10     2 381089 Jud  
# ... with 590 more rows

You can then continue your pipeline like this然后你可以像这样继续你的管道

df %>% 
  expand(bs = 1:100, nesting(values, Pop1)) %>% 
  group_by(bs, Pop1) %>% 
  sample_n(size = 50) %>%
  summarize(Mean = mean(x = values))

Here is a version using a for loop to do the sampling 100 times.这是一个使用 for 循环进行 100 次采样的版本。

df2 <- data.frame(values = numeric(), Pop1 = character(), bs = integer())
for(i in 1:100){
  df2 <- df2 %>%
    bind_rows(df %>% 
                group_by(Pop1) %>%
                sample_n(size = 50, replace = TRUE) %>%
                mutate(bs = i) %>% 
                ungroup())
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM