[英]subset a data frame with dplyr and conditions
我有一个数据框,如:
Groups Name names2 Category value
G1 A habit1 cat1 20
G1 A habit2 NA 1
G1 B habit3 NA 100
G1 B habit4 cat3 23
G2 A habit5 cat4 32
G2 C habit6 NA 100
G2 C habit7 cat2 21
G2 D habit8 cat3 34
G2 D habit9 cat5 43
我想每个Groups
和每个Name
只保留一行
得到:
Groups Name names2 Category value
G1 A habit1 cat1 20
G1 B habit4 cat3 23
G2 A habit5 cat4 32
G2 C habit7 cat2 21
G2 D habit9 cat5 43
其中内的行Group
的Name
获胜是一排,其中有在信息names2
(不是NA
),并且如果在所有信息中,具有最高值胜一个(as G2-D vs G2-D)
42获胜因为42 > 34
如果只有NA
,那么保持最佳值的行。
谢谢您的帮助
你需要的是group_by
with filter
然后top_n
:
library(dplyr)
my.df %>%
group_by(Groups, Name) %>%
filter(!is.na(Category)) %>%
top_n(1, value)
# A tibble: 5 x 5
# Groups: Groups, Name [5]
# Groups Name names2 Category value
# <chr> <chr> <chr> <chr> <int>
# 1 G1 A habit1 cat1 20
# 2 G1 B habit4 cat3 23
# 3 G2 A habit5 cat4 32
# 4 G2 C habit7 cat2 21
# 5 G2 D habit9 cat5 43
但是,这将排除对于该名称,组合的所有部分缺少类别的组,并且如果存在多个最大值,则保留所有这些组。
数据
my.df <- structure(list(Groups = c("G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2"),
Name = c("A", "A", "B", "B", "A", "C", "C", "D", "D"),
names2 = c("habit1", "habit2", "habit3", "habit4", "habit5", "habit6", "habit7", "habit8", "habit9"),
Category = c("cat1", NA, NA, "cat3", "cat4", NA, "cat2", "cat3", "cat5"),
value = c(20L, 1L, 100L, 23L, 32L, 100L, 21L, 34L, 43L)),
class = "data.frame", row.names = c(NA, -9L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.