[英]R - dplyr - filter top_n rows based on multiple conditions
I hope to get it right and reproducible. 我希望能够做到正确且可重复。
I was wondering if there is a more elegant solution than my approach below 我想知道是否有比我下面的方法更优雅的解决方案
I have a dataframe and would like to use conditional filters and extract rows that meet these conditions. 我有一个数据框,并希望使用条件过滤器并提取符合这些条件的行。
As output I would like the top_n rows that meet the conditional criteria (different conditions for top_n output from different columns), whilst preserving all other columns. 作为输出,我希望top_n行符合条件标准(不同列的top_n输出的不同条件),同时保留所有其他列。
Example dataframe: 示例数据帧:
set.seed(123)
df1 <- data.frame(
A = as.numeric(1:10),
B = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), size=10),
C = as.numeric(sample(20:90, size = 10)),
D = sample(c("yes", "no"), size=10, replace = TRUE),
E = as.numeric(sample(1000:2000, size = 10))
)
df1 #check output
> df1 #check output
A B C D E
1 1 2005-03-06 87 no 1963
2 2 2014-03-11 51 no 1902
3 3 2007-05-12 66 no 1690
4 4 2015-11-22 58 no 1793
5 5 2016-12-02 26 no 1024
6 6 2000-10-26 79 no 1475
7 7 2009-07-01 35 no 1754
8 8 2016-01-19 22 no 1215
9 9 2009-11-30 40 yes 1315
10 10 2008-03-17 85 yes 1229
Conditions I would like to use for filtering: 我想用于过滤的条件:
A) if column E is between 1000 and 1500 return top 2 rows weighted on column A A)如果列E在1000和1500之间,则返回在A列上加权的前2行
B) if column E is between 1000 and 2000 return top 2 rows weighted on column B B)如果列E在1000和2000之间,则返回在列B上加权的前2行
C) if column E is between 1000 and 1400 return top 2 rows weighted on column C C)如果列E在1000和1400之间,则返回在列C上加权的前2行
I have come up with the following solution but it is cumbersome and I wondered if there is a better approach. 我提出了以下解决方案,但它很麻烦,我想知道是否有更好的方法。
library("dplyr")
library("tidyr")
A<- df1 %>% dplyr::filter(E >= 1000 & E <= 1500) %>% top_n( n = 2, wt = A) %>% arrange(-A) %>% mutate(condition = "-cond_A")
B<- df1 %>% dplyr::filter(E >= 1000 & E <= 2000) %>% top_n( n = 2, wt = B) %>% arrange(B) %>% mutate(condition = "cond_B")
C<- df1 %>% dplyr::filter(E >= 1000 & E <= 1400) %>% top_n( n = 2, wt = C) %>% arrange(-C) %>% mutate(condition = "-cond_C")
my desired output is the following: 我想要的输出如下:
spread(as.data.frame(distinct(bind_rows(A,B,C))),condition, condition)
A B C D E -cond_A -cond_C cond_B
1 5 2016-12-02 26 no 1024 <NA> <NA> cond_B
2 8 2016-01-19 22 no 1215 <NA> <NA> cond_B
3 9 2009-11-30 40 yes 1315 -cond_A -cond_C <NA>
4 10 2008-03-17 85 yes 1229 -cond_A -cond_C <NA>
that's great, thank you so much! 太好了,非常感谢你!
In my comments I asked if you could have more arguments to map2, and I realised that pmap can do just that. 在我的评论中,我问你是否可以有更多的map2参数,我意识到pmap可以做到这一点。
pmap(list(c(1500, 2000, 1400), c(1000, 1700, 1300), names(df1)[1:3]),
~ df1 %>%
filter(E >= ..2 & E <= ..1) %>%
top_n(n=2, wt = !! rlang::sym(..3)) %>%
arrange_at(..3, funs(desc(.))) %>%
mutate(condition = paste0("-cond", ..3))) %>%
bind_rows %>%
distinct %>%
spread(condition, condition)
We could use map2
from purrr
to loop through the <=
condition which changes and also the wt
argument that takes the column names (based on the OP's code) 我们可以使用
purrr
map2
循环遍历<=
更改的条件以及获取列名的wt
参数(基于OP的代码)
library(purrr)
library(dplyr)
library(tidyr)
map2(c(1500, 2000, 1400), names(df1)[1:3],
~ df1 %>%
filter(E >= 1000 & E <= .x) %>%
top_n(n=2, wt = !! rlang::sym(.y)) %>%
arrange_at(.y, funs(desc(.))) %>%
mutate(condition = paste0("-cond", .y))) %>%
bind_rows %>%
distinct %>%
spread(condition, condition)
# A B C D E -condA -condB -condC
#1 5 2016-12-02 26 no 1024 <NA> -condB <NA>
#2 8 2016-01-19 22 no 1215 <NA> -condB <NA>
#3 9 2009-11-30 40 yes 1315 -condA <NA> -condC
#4 10 2008-03-17 85 yes 1229 -condA <NA> -condC
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.