[英]How to randomly sample dataframe (sample_n) and calculate summary statistics after using group_by, and iterate 999 times?
I want to resample my dataframe (test_df) and calculate summary statistics (mean and standard deviation) of a numeric response variable (sp_rich), after grouping data based on two categorical factors (plant_sp = plant species, and site). 在基于两个分类因素(plant_sp =植物种类和地点)对数据进行分组之后,我想对数据帧(test_df)进行重新采样并计算数值响应变量(sp_rich)的摘要统计量(均值和标准差)。 I would then like this process to be iterated, say 999 times.
然后,我希望此过程可以重复进行999次。 Additionally, I would like to resample the data frame using multiple sample sizes, and calculate the above statistics and perform the iteration.
另外,我想使用多个样本大小对数据帧进行重新采样,并计算以上统计信息并执行迭代。
Ultimately, I would really like this to be in a dplyr/tidy framework, as I am more familiar with this style, but am open to base R/other options. 归根结底,我真的很想将它放在dplyr / tidy框架中,因为我对这种样式更加熟悉,但是可以接受基本的R /其他选项。
So here is an example data frame: 因此,这是一个示例数据帧:
test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
"plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
"plant_2", "plant_2", "plant_2", "plant_2", "plant_2",
"plant_2", "plant_2", "plant_2", "plant_2", "plant_2"),
site = c("a", "a", "a", "a", "a",
"b", "b", "b", "b", "b",
"a", "a", "a", "a", "a",
"b", "b", "b", "b", "b"),
sp_rich = c(5, 3, 5, 3, 5,
7, 8, 8, 8, 10,
1, 4, 5, 6, 3,
7, 3, 12, 12,11)),
row.names = c(NA, -20L), class = "data.frame",
.Names = c("plant_sp", "site", "sp_rich"))
# I can calculate the summary statistics for one iteration,
# and for one sample size at a time:
mean_calc <- test_df %>%
group_by(plant_sp, site) %>%
do(sample_n(., 3)) %>%
summarise(mean = mean(sp_rich),
sd = sd((sp_rich))) %>%
mutate(sample_size = n())
> mean_calc
# A tibble: 4 x 5
# Groups: plant_sp [2]
plant_sp site mean sd sample_size
<fct> <fct> <dbl> <dbl> <dbl>
1 A GHT 7 2 3
2 A PE 3.33 0.577 3
3 B GHT 3.33 1.53 3
4 B PE 1.67 0.577 3
# I can also manually perform the calculations manually for
# each sample size, and put the data together (hack):
# Do this manually for two different samples sizes
mean_calc_3 <- test_df %>%
group_by(plant_sp, site) %>%
do(sample_n(., 3)) %>%
summarise(mean = mean(sp_rich),
sd = sd((sp_rich))) %>%
mutate(sample_size = 3)
mean_calc_3
mean_calc_4 <- test_df %>%
group_by(plant_sp, site) %>%
do(sample_n(., 4)) %>%
summarise(mean = mean(sp_rich),
sd = sd((sp_rich))) %>%
mutate(sample_size = 4)
mean_calc_4
mean_calc <- bind_rows(mean_calc_3, mean_calc_4)
mean_calc <- mean_calc %>%
group_by(plant_sp, site, sample_size) %>%
arrange(sample_size, plant_sp, site)
# A tibble: 8 x 5
# Groups: plant_sp, site, sample_size [8]
plant_sp site mean sd sample_size
<fct> <fct> <dbl> <dbl> <dbl>
1 A GHT 5.67 1.53 3
2 A PE 4.33 1.53 3
3 B GHT 3.67 1.15 3
4 B PE 2 1 3
5 A GHT 6.5 2.08 4
6 A PE 4.25 1.26 4
7 B GHT 2.75 0.5 4
8 B PE 2.25 0.5 4
I would really like to automate performing these calculate across multiple sample sizes (eg n = 3, n = 4, in this example, the proper data would have ~ 5-10 different sizes classes), and then iterate this entire process 999 times. 我真的很想自动执行跨多个样本大小的这些计算(例如,n = 3,n = 4,在此示例中,适当的数据将具有〜5-10个不同大小的类别),然后将整个过程进行999次迭代。
The structure of the mean_calc
df is ultimately the output that I am looking for, just instead of calculating the mean and sd once, the summary statistics are calculated 999 times and averaged. mean_calc
df的结构最终是我要寻找的输出,而不是一次计算平均值和sd,汇总统计量被计算999次并取平均值。
library(tidyverse)
...<your test_df>...
test_df %>% group_by(plant_sp, site) %>%
nest() %>%
crossing(sample_size=c(3,4,5), iter = seq(1:10)) %>%
mutate(sample_data = map2(data, sample_size, ~sample_n(.x,.y))) %>%
mutate(calc = map(sample_data,
~summarise(.,mean = mean(sp_rich),sd = sd((sp_rich))))) %>%
select(plant_sp, site, sample_size, iter, calc) %>%
unnest() %>%
group_by(plant_sp, site, sample_size) %>%
arrange(sample_size, plant_sp, site)
Here the sample size is c(3,4,5)
, and the iteration is 10 as an example 此处的样本大小为
c(3,4,5)
,迭代为10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.