[英]How do you filter out data in the first group based data in the second group in dplyr and/or tidyverse
我有一个数据框 (df),其中包括以下列:马名、年龄和速度数字(值)。 最初,我使用 ggplot geom_boxplot 绘制数据以查看按年龄划分的平均速度数值。
现在我想做同样的情节,但这次只包括在两岁的时候比赛过 3 次以上的马,但我正在努力弄清楚如何做到这一点。
我尝试 group_by(horse, age),然后总结每匹马在每个年龄比赛的次数,最后过滤掉 2 年时 n < 4 的马。不幸的是,我认为我的逻辑/方法可能有缺陷。
任何人都可以想出一种优雅的方式来实现这一点。 看起来很简单,但我很挣扎。
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(brew)
#> Warning: package 'brew' was built under R version 4.0.3
df <- tibble(horse=c("a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d"),
age = c(2,2,2,2,2,3,3,3,4,4,2,2,3,3,3,4,2,2,2,2,2,3,3,3,3,3,4,4,2,3,3,3,3,4),
value = c(20,21,19,23,20,17,16,23,24,14,23,24,18,19,16,19,17,24,19,18,17,15,18,12,12,14,15,11,23,24,14,23,24,18))
df
#> # A tibble: 34 x 3
#> horse age value
#> <chr> <dbl> <dbl>
#> 1 a 2 20
#> 2 a 2 21
#> 3 a 2 19
#> 4 a 2 23
#> 5 a 2 20
#> 6 a 3 17
#> 7 a 3 16
#> 8 a 3 23
#> 9 a 4 24
#> 10 a 4 14
#> # ... with 24 more rows
df %>%
ggplot(aes(x=as.factor(age), y=value, fill=as.factor(age))) +
geom_boxplot(alpha=0.7) +
stat_summary(fun.y=mean, geom="point", shape=20, size=8, color="red", fill="red") +
stat_summary(fun = mean, geom = "text", col = "black", # Add text to plot
vjust = -1.5, aes(label = paste("X:", round(..y.., digits = 1)))) +
theme(legend.position="none") +
scale_fill_brewer(palette="Set1")
#> Warning: `fun.y` is deprecated. Use `fun` instead.
由reprex 包(v0.3.0) 于 2021 年 6 月 19 日创建
如果我正确理解了您的目标,则以下内容应该有效。
在这里,我假设您要保留那些在 2 岁时至少参加过 3 场比赛的马的所有观察结果,也就是说,还要保留之前和之后的比赛,而不仅仅是那些在 2 岁时的观察结果。
library(dplyr)
df <- tibble(horse=c("a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d"),
age = c(2,2,2,2,2,3,3,3,4,4,2,2,3,3,3,4,2,2,2,2,2,3,3,3,3,3,4,4,2,3,3,3,3,4),
value = c(20,21,19,23,20,17,16,23,24,14,23,24,18,19,16,19,17,24,19,18,17,15,18,12,12,14,15,11,23,24,14,23,24,18))
df %>% group_by(horse, age) %>%
mutate(n_races_by_age = n(),
check_if_keep = if_else(age == 2 & n_races_by_age >= 3, 1, 0)) %>%
ungroup(age) %>%
mutate(
horse_to_keep = max(check_if_keep)
# it is still grouped by horse, so keep all observations of those horses for
# which the above conditions are met.
)
#> # A tibble: 34 x 6
#> # Groups: horse [4]
#> horse age value n_races_by_age check_if_keep horse_to_keep
#> <chr> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 a 2 20 5 1 1
#> 2 a 2 21 5 1 1
#> 3 a 2 19 5 1 1
#> 4 a 2 23 5 1 1
#> 5 a 2 20 5 1 1
#> 6 a 3 17 3 0 1
#> 7 a 3 16 3 0 1
#> 8 a 3 23 3 0 1
#> 9 a 4 24 2 0 1
#> 10 a 4 14 2 0 1
#> # … with 24 more rows
如果这就是您的意思,那么您只需要添加%>% filter(horse_to_keep==1)
即可获得所需的结果。
这里有几种方法可以将马匹保留在比 2 岁马匹比赛次数超过 3 次的数据中。
filter
-library(dplyr)
df %>%
group_by(horse) %>%
filter(sum(age == 2) > 3) %>%
ungroup
# horse age value
# <chr> <dbl> <dbl>
# 1 a 2 20
# 2 a 2 21
# 3 a 2 19
# 4 a 2 23
# 5 a 2 20
# 6 a 3 17
# 7 a 3 16
# 8 a 3 23
# 9 a 4 24
#10 a 4 14
# … with 12 more rows
df %>%
filter(age == 2) %>%
count(horse) %>%
filter(n > 3) %>%
select(-n) %>%
left_join(df, by = 'horse')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.