您如何在 dplyr 和/或 tidyverse 中过滤第二组中基于第一组的数据中的数据

Question

I have a data frame (df) that includes the following columns: horse names, ages and speed figures (value).我有一个数据框 (df)，其中包括以下列：马名、年龄和速度数字（值）。 Initially, I plot the data with ggplot geom_boxplot to see the average speed figure value by age.最初，我使用 ggplot geom_boxplot 绘制数据以查看按年龄划分的平均速度数值。

Now I would like to do the same plot, but this time only include horses that have raced more than three times as a two-year-old, but I'm struggling to figure out how to accomplish this.现在我想做同样的情节，但这次只包括在两岁的时候比赛过 3 次以上的马，但我正在努力弄清楚如何做到这一点。

I tried to group_by(horse, age), then summarise the nummber of times each horse raced at each age, and finally filter out horses that at 2 years has an n < 4. Unfortunately, I think my logic/approach may be flawed.我尝试 group_by(horse, age)，然后总结每匹马在每个年龄比赛的次数，最后过滤掉 2 年时 n < 4 的马。不幸的是，我认为我的逻辑/方法可能有缺陷。

Can anyone think of an elegant way of accomplishing this.任何人都可以想出一种优雅的方式来实现这一点。 It seems straightforward, yet I struggle.看起来很简单，但我很挣扎。

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(brew)
#> Warning: package 'brew' was built under R version 4.0.3

df <- tibble(horse=c("a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d"),
             age = c(2,2,2,2,2,3,3,3,4,4,2,2,3,3,3,4,2,2,2,2,2,3,3,3,3,3,4,4,2,3,3,3,3,4),
             value = c(20,21,19,23,20,17,16,23,24,14,23,24,18,19,16,19,17,24,19,18,17,15,18,12,12,14,15,11,23,24,14,23,24,18))


df
#> # A tibble: 34 x 3
#>    horse   age value
#>    <chr> <dbl> <dbl>
#>  1 a         2    20
#>  2 a         2    21
#>  3 a         2    19
#>  4 a         2    23
#>  5 a         2    20
#>  6 a         3    17
#>  7 a         3    16
#>  8 a         3    23
#>  9 a         4    24
#> 10 a         4    14
#> # ... with 24 more rows



df %>%  
  ggplot(aes(x=as.factor(age), y=value, fill=as.factor(age))) +
  geom_boxplot(alpha=0.7) +
  stat_summary(fun.y=mean, geom="point", shape=20, size=8, color="red", fill="red") +
  stat_summary(fun = mean, geom = "text", col = "black",     # Add text to plot
               vjust = -1.5, aes(label = paste("X:", round(..y.., digits = 1)))) +
  theme(legend.position="none") +
  scale_fill_brewer(palette="Set1")
#> Warning: `fun.y` is deprecated. Use `fun` instead.

^{Created on 2021-06-19 by the reprex package (v0.3.0)}^{由reprex 包(v0.3.0) 于 2021 年 6 月 19 日创建}

Answer 1

If I understood what you are aiming for correctly, the following should work.如果我正确理解了您的目标，则以下内容应该有效。

Here I assume you want to keep all observations for those horses that races at least 3 races when they were 2 years old, that is, keep also the races before and after as well and not only those observations when they were 2 years old.在这里，我假设您要保留那些在 2 岁时至少参加过 3 场比赛的马的所有观察结果，也就是说，还要保留之前和之后的比赛，而不仅仅是那些在 2 岁时的观察结果。

library(dplyr)

df <- tibble(horse=c("a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d"),
             age = c(2,2,2,2,2,3,3,3,4,4,2,2,3,3,3,4,2,2,2,2,2,3,3,3,3,3,4,4,2,3,3,3,3,4),
             value = c(20,21,19,23,20,17,16,23,24,14,23,24,18,19,16,19,17,24,19,18,17,15,18,12,12,14,15,11,23,24,14,23,24,18))

df %>% group_by(horse, age) %>% 
  mutate(n_races_by_age = n(),         
         check_if_keep = if_else(age == 2 & n_races_by_age >= 3, 1, 0)) %>% 
  ungroup(age) %>% 
  mutate(
    horse_to_keep = max(check_if_keep)
    # it is still grouped by horse, so keep all observations of those horses for 
    # which the above conditions are met. 
  )
#> # A tibble: 34 x 6
#> # Groups:   horse [4]
#>    horse   age value n_races_by_age check_if_keep horse_to_keep
#>    <chr> <dbl> <dbl>          <int>         <dbl>         <dbl>
#>  1 a         2    20              5             1             1
#>  2 a         2    21              5             1             1
#>  3 a         2    19              5             1             1
#>  4 a         2    23              5             1             1
#>  5 a         2    20              5             1             1
#>  6 a         3    17              3             0             1
#>  7 a         3    16              3             0             1
#>  8 a         3    23              3             0             1
#>  9 a         4    24              2             0             1
#> 10 a         4    14              2             0             1
#> # … with 24 more rows

If that is what you mean, then you would only need to add %>% filter(horse_to_keep==1) to achieve the desired results.如果这就是您的意思，那么您只需要添加%>% filter(horse_to_keep==1)即可获得所需的结果。

Answer 2

Here are couple of ways to keep the horses in the data that have raced more than 3 times as 2 year old.这里有几种方法可以将马匹保留在比 2 岁马匹比赛次数超过 3 次的数据中。

Using filter -使用filter -

library(dplyr)

df %>%
  group_by(horse) %>%
  filter(sum(age == 2) > 3) %>%
  ungroup

#   horse   age value
#   <chr> <dbl> <dbl>
# 1 a         2    20
# 2 a         2    21
# 3 a         2    19
# 4 a         2    23
# 5 a         2    20
# 6 a         3    17
# 7 a         3    16
# 8 a         3    23
# 9 a         4    24
#10 a         4    14
# … with 12 more rows

Using join使用连接

df %>%
  filter(age == 2) %>%
  count(horse) %>%
  filter(n > 3) %>%
  select(-n) %>%
  left_join(df, by = 'horse')

您如何在 dplyr 和/或 tidyverse 中过滤第二组中基于第一组的数据中的数据

问题描述

2 个解决方案

解决方案1
1 2021-06-19 18:18:54

解决方案2
1 已采纳 2021-06-20 00:57:47

您如何在 dplyr 和/或 tidyverse 中过滤第二组中基于第一组的数据中的数据

问题描述

2 个解决方案

解决方案1 1 2021-06-19 18:18:54

解决方案2 1 已采纳 2021-06-20 00:57:47

解决方案1
1 2021-06-19 18:18:54

解决方案2
1 已采纳 2021-06-20 00:57:47