[英]How to randomly sample dataframe (sample_n) and calculate summary statistics after using group_by, and iterate 999 times?
[英]Sample_n with if_else after group_by in dataframe
這是一個測試DF:
test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_2", "plant_2", "plant_3",
"plant_3", "plant_3", "plant_3", "plant_3", "plant_4",
"plant_4", "plant_4", "plant_4", "plant_4", "plant_4",
"plant_5", "plant_5", "plant_5", "plant_5", "plant_5"),
site = c("a", "a", "a", "a", "a",
"b", "b", "b", "b", "b",
"a", "a", "a", "a", "a",
"b", "b", "b", "b", "b"),
sp_rich = c(5, 3, 5, 3, 5,
7, 8, 8, 8, 10,
1, 4, 5, 6, 3,
7, 3, 12, 12,11)),
row.names = c(NA, -20L), class = "data.frame",
.Names = c("plant_sp", "site", "sp_rich"))
如果組中的行數大於 3,我想 group_by plant_sp 並提取 3 個隨機行。
換句話說:取每個組,如果組大小大於 3,則在該組中隨機只保留 3 行。
我正在嘗試使用 if_else 但我無法做到這一點:
test_df <- test_df %>% group_by(plant_sp) %>%
if_else(length(plant_sp) > 3, sample_n(size =3))
我想我沒有使用 length() function 對。
你能幫助我嗎?
謝謝,伊多
這有幫助嗎? 也許不是最優雅的版本,但應該可以解決問題。
這里是針對評論的編輯答案:
test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_2", "plant_2", "plant_3",
"plant_3", "plant_3", "plant_3", "plant_3", "plant_4",
"plant_4", "plant_4", "plant_4", "plant_4", "plant_4",
"plant_5", "plant_5", "plant_5", "plant_5", "plant_5"),
site = c("a", "a", "a", "a", "a",
"b", "b", "b", "b", "b",
"a", "a", "a", "a", "a",
"b", "b", "b", "b", "b"),
sp_rich = c(5, 3, 5, 3, 5,
7, 8, 8, 8, 10,
1, 4, 5, 6, 3,
7, 3, 12, 12,11)),
row.names = c(NA, -20L), class = "data.frame",
.Names = c("plant_sp", "site", "sp_rich"))
library(tidyverse)
df_group <- test_df %>%
group_by(plant_sp) %>%
mutate(row_number=row_number()) %>%
mutate(row_max=max(row_number)) %>%
ungroup()
df_3 <- df_group %>%
group_by(plant_sp) %>%
filter(row_max>3) %>%
slice_sample(n = 3)
df_small <- df_group %>%
filter(row_max<4)
df_test <- bind_rows(df_3, df_small) %>%
arrange(plant_sp)
df_test
#> # A tibble: 13 x 5
#> # Groups: plant_sp [5]
#> plant_sp site sp_rich row_number row_max
#> <chr> <chr> <dbl> <int> <int>
#> 1 plant_1 a 5 1 2
#> 2 plant_1 a 3 2 2
#> 3 plant_2 a 5 1 2
#> 4 plant_2 a 3 2 2
#> 5 plant_3 b 8 4 5
#> 6 plant_3 a 5 1 5
#> 7 plant_3 b 7 2 5
#> 8 plant_4 a 3 6 6
#> 9 plant_4 b 10 1 6
#> 10 plant_4 a 5 4 6
#> 11 plant_5 b 7 1 5
#> 12 plant_5 b 12 4 5
#> 13 plant_5 b 12 3 5
由代表 package (v0.3.0) 於 2020 年 11 月 30 日創建
如果您使用的是dplyr
1.0.0 或更高版本,則可以使用slice_sample
。 它將在每組中保留 3 行。 如果每組中的行數少於 3,它將保留所有行。
library(dplyr)
test_df %>% group_by(plant_sp) %>% slice_sample(n = 3)
# plant_sp site sp_rich
# <chr> <chr> <dbl>
# 1 plant_1 a 3
# 2 plant_1 a 5
# 3 plant_2 a 5
# 4 plant_2 a 3
# 5 plant_3 b 8
# 6 plant_3 b 8
# 7 plant_3 b 7
# 8 plant_4 b 10
# 9 plant_4 a 5
#10 plant_4 a 4
#11 plant_5 b 7
#12 plant_5 b 12
#13 plant_5 b 3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.