在使用group_by之后，如何对数据帧（sample_n）进行随机采样并计算汇总统计信息，并迭代999次？

Question

I want to resample my dataframe (test_df) and calculate summary statistics (mean and standard deviation) of a numeric response variable (sp_rich), after grouping data based on two categorical factors (plant_sp = plant species, and site). 在基于两个分类因素（plant_sp =植物种类和地点）对数据进行分组之后，我想对数据帧（test_df）进行重新采样并计算数值响应变量（sp_rich）的摘要统计量（均值和标准差）。 I would then like this process to be iterated, say 999 times. 然后，我希望此过程可以重复进行999次。 Additionally, I would like to resample the data frame using multiple sample sizes, and calculate the above statistics and perform the iteration. 另外，我想使用多个样本大小对数据帧进行重新采样，并计算以上统计信息并执行迭代。

Ultimately, I would really like this to be in a dplyr/tidy framework, as I am more familiar with this style, but am open to base R/other options. 归根结底，我真的很想将它放在dplyr / tidy框架中，因为我对这种样式更加熟悉，但是可以接受基本的R /其他选项。

So here is an example data frame: 因此，这是一个示例数据帧：

test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
                                       "plant_1", "plant_1", "plant_1", "plant_1", "plant_1", 
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2",
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2"), 
                          site = c("a", "a", "a", "a", "a",  
                                   "b", "b", "b", "b", "b",  
                                   "a", "a", "a", "a", "a",
                                   "b", "b", "b", "b", "b"),
                          sp_rich = c(5, 3, 5, 3, 5, 
                                      7, 8, 8, 8, 10,
                                      1, 4, 5, 6, 3, 
                                      7, 3, 12, 12,11)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "site", "sp_rich"))

# I can calculate the summary statistics for one iteration,   
# and for one sample size at a time:

mean_calc <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 3)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = n())

> mean_calc
# A tibble: 4 x 5
# Groups:   plant_sp [2]
  plant_sp site   mean    sd sample_size
  <fct>    <fct> <dbl> <dbl>       <dbl>
1 A        GHT    7    2               3
2 A        PE     3.33 0.577           3
3 B        GHT    3.33 1.53            3
4 B        PE     1.67 0.577           3

# I can also manually perform the calculations manually for   
# each sample size, and put the data together (hack):

# Do this manually for two different samples sizes
mean_calc_3 <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 3)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = 3)
mean_calc_3

mean_calc_4 <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 4)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = 4)
mean_calc_4

mean_calc <- bind_rows(mean_calc_3, mean_calc_4) 
mean_calc <- mean_calc %>%
    group_by(plant_sp, site, sample_size) %>%
    arrange(sample_size, plant_sp, site)

# A tibble: 8 x 5
# Groups:   plant_sp, site, sample_size [8]
  plant_sp site   mean    sd sample_size
  <fct>    <fct> <dbl> <dbl>       <dbl>
1 A        GHT    5.67  1.53           3
2 A        PE     4.33  1.53           3
3 B        GHT    3.67  1.15           3
4 B        PE     2     1              3
5 A        GHT    6.5   2.08           4
6 A        PE     4.25  1.26           4
7 B        GHT    2.75  0.5            4
8 B        PE     2.25  0.5            4

I would really like to automate performing these calculate across multiple sample sizes (eg n = 3, n = 4, in this example, the proper data would have ~ 5-10 different sizes classes), and then iterate this entire process 999 times. 我真的很想自动执行跨多个样本大小的这些计算（例如，n = 3，n = 4，在此示例中，适当的数据将具有〜5-10个不同大小的类别），然后将整个过程进行999次迭代。

The structure of the mean_calc df is ultimately the output that I am looking for, just instead of calculating the mean and sd once, the summary statistics are calculated 999 times and averaged. mean_calc df的结构最终是我要寻找的输出，而不是一次计算平均值和sd，汇总统计量被计算999次并取平均值。

Answer 1

library(tidyverse) 
...<your test_df>...

test_df %>% group_by(plant_sp, site) %>% 
            nest() %>% 
            crossing(sample_size=c(3,4,5), iter = seq(1:10)) %>% 
            mutate(sample_data = map2(data, sample_size, ~sample_n(.x,.y))) %>% 
            mutate(calc = map(sample_data, 
                    ~summarise(.,mean = mean(sp_rich),sd = sd((sp_rich))))) %>% 
            select(plant_sp, site, sample_size, iter, calc) %>% 
            unnest() %>% 
            group_by(plant_sp, site, sample_size) %>%
            arrange(sample_size, plant_sp, site)

Here the sample size is c(3,4,5) , and the iteration is 10 as an example 此处的样本大小为c(3,4,5) ，迭代为10

在使用group_by之后，如何对数据帧（sample_n）进行随机采样并计算汇总统计信息，并迭代999次？

问题描述

1 个解决方案

解决方案1
0 2019-07-23 13:32:56

在使用group_by之后，如何对数据帧（sample_n）进行随机采样并计算汇总统计信息，并迭代999次？

问题描述

1 个解决方案

解决方案1 0 2019-07-23 13:32:56

解决方案1
0 2019-07-23 13:32:56