Sample_n 和 if_else 在 dataframe 中 group_by 之后

Question

here is a test DF:这是一个测试DF：

test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_2", "plant_2", "plant_3",
                                       "plant_3", "plant_3", "plant_3", "plant_3", "plant_4", 
                                       "plant_4", "plant_4", "plant_4", "plant_4", "plant_4",
                                       "plant_5", "plant_5", "plant_5", "plant_5", "plant_5"), 
                          site = c("a", "a", "a", "a", "a",  
                                   "b", "b", "b", "b", "b",  
                                   "a", "a", "a", "a", "a",
                                   "b", "b", "b", "b", "b"),
                          sp_rich = c(5, 3, 5, 3, 5, 
                                      7, 8, 8, 8, 10,
                                      1, 4, 5, 6, 3, 
                                      7, 3, 12, 12,11)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "site", "sp_rich"))

I want to group_by plant_sp and to extract 3 random rows if number of rows in the group is bigger than 3.如果组中的行数大于 3，我想 group_by plant_sp 并提取 3 个随机行。

In other words: take each group and if group size is bigger than 3, randomly keep only 3 rows in this group.换句话说：取每个组，如果组大小大于 3，则在该组中随机只保留 3 行。

I'm trying using if_else but I'm unable to do this:我正在尝试使用 if_else 但我无法做到这一点：

test_df <- test_df %>% group_by(plant_sp) %>%
if_else(length(plant_sp) > 3, sample_n(size =3))

I guess that I'm not using the length() function right.我想我没有使用 length() function 对。

can you help me?你能帮助我吗？

thanks, Ido谢谢，伊多

Answer 1

Does this help?这有帮助吗？ Maybe not the most elegant version, but should do the trick.也许不是最优雅的版本，但应该可以解决问题。

Here the edited answer in response to the comment:这里是针对评论的编辑答案：

test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_2", "plant_2", "plant_3",
                                       "plant_3", "plant_3", "plant_3", "plant_3", "plant_4", 
                                       "plant_4", "plant_4", "plant_4", "plant_4", "plant_4",
                                       "plant_5", "plant_5", "plant_5", "plant_5", "plant_5"), 
                          site = c("a", "a", "a", "a", "a",  
                                   "b", "b", "b", "b", "b",  
                                   "a", "a", "a", "a", "a",
                                   "b", "b", "b", "b", "b"),
                          sp_rich = c(5, 3, 5, 3, 5, 
                                      7, 8, 8, 8, 10,
                                      1, 4, 5, 6, 3, 
                                      7, 3, 12, 12,11)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "site", "sp_rich"))

library(tidyverse)
df_group <- test_df %>% 
  group_by(plant_sp) %>% 
  mutate(row_number=row_number()) %>% 
  mutate(row_max=max(row_number)) %>% 
  ungroup()

df_3 <- df_group %>% 
  group_by(plant_sp) %>% 
  filter(row_max>3) %>% 
  slice_sample(n = 3)

df_small <- df_group %>% 
  filter(row_max<4)

df_test <- bind_rows(df_3, df_small) %>% 
  arrange(plant_sp)
df_test
#> # A tibble: 13 x 5
#> # Groups:   plant_sp [5]
#>    plant_sp site  sp_rich row_number row_max
#>    <chr>    <chr>   <dbl>      <int>   <int>
#>  1 plant_1  a           5          1       2
#>  2 plant_1  a           3          2       2
#>  3 plant_2  a           5          1       2
#>  4 plant_2  a           3          2       2
#>  5 plant_3  b           8          4       5
#>  6 plant_3  a           5          1       5
#>  7 plant_3  b           7          2       5
#>  8 plant_4  a           3          6       6
#>  9 plant_4  b          10          1       6
#> 10 plant_4  a           5          4       6
#> 11 plant_5  b           7          1       5
#> 12 plant_5  b          12          4       5
#> 13 plant_5  b          12          3       5

^{Created on 2020-11-30 by the reprex package (v0.3.0)}^{由代表 package (v0.3.0) 于 2020 年 11 月 30 日创建}

Answer 2

You can use slice_sample if you are on dplyr 1.0.0 or above.如果您使用的是dplyr 1.0.0 或更高版本，则可以使用slice_sample 。 It will keep 3 rows in each group.它将在每组中保留 3 行。 If number of rows in each group is less than 3 it will keep all the rows.如果每组中的行数少于 3，它将保留所有行。

library(dplyr)
test_df %>% group_by(plant_sp) %>% slice_sample(n = 3)

#  plant_sp site  sp_rich
#   <chr>    <chr>   <dbl>
# 1 plant_1  a           3
# 2 plant_1  a           5
# 3 plant_2  a           5
# 4 plant_2  a           3
# 5 plant_3  b           8
# 6 plant_3  b           8
# 7 plant_3  b           7
# 8 plant_4  b          10
# 9 plant_4  a           5
#10 plant_4  a           4
#11 plant_5  b           7
#12 plant_5  b          12
#13 plant_5  b           3

Sample_n 和 if_else 在 dataframe 中 group_by 之后

问题描述

2 个解决方案

解决方案1
2 2020-11-30 12:35:37

解决方案2
2 已采纳 2020-12-01 00:39:34

Sample_n 和 if_else 在 dataframe 中 group_by 之后

问题描述

2 个解决方案

解决方案1 2 2020-11-30 12:35:37

解决方案2 2 已采纳 2020-12-01 00:39:34

解决方案1
2 2020-11-30 12:35:37

解决方案2
2 已采纳 2020-12-01 00:39:34