[英]Filter a dataset having duplicated rows in R
I need to filter a dataset based on two conditions.我需要根据两个条件过滤数据集。
Here is how my dataset looks like:这是我的数据集的样子:
df <- data.frame(
id = c(1,2,2,3,3,4,5,5),
district = c(10,10,11,12,12,13,14,15),
value = c(10.2, 10.8, 10.8, 7.5, 9.3, 6, 7.0, 7.0))
> df
id district value
1 1 10 10.2
2 2 10 10.8
3 2 11 10.8
4 3 12 7.5
5 3 12 9.3
6 4 13 6.0
7 5 14 7.0
8 5 15 7.0
I have duplicated rows based on id
s.我有基于id
的重复行。 In order to keep the desired row, First
id
s having the multiple districts but the same value
, I need to keep the first row: Second
id
s having multiple value
s, but from the same district, I need the max
of value row.为了保留所需的行, First
id
s 具有多个区但value
相同,我需要保留第一行: Second
id
s 具有多个value
s,但来自同一区,我需要值行的max
。
SO the desired filtered dataset is:所以所需的过滤数据集是:
> df
id district value
1 1 10 10.2
2 2 10 10.8
3 3 12 9.3
4 4 13 6.0
5 5 14 7.0
I was able to locate the duplicated ids only up until now.到目前为止,我只能找到重复的 ID。
df[duplicated(df$id),]
Does anyone have any ideas?有没有人有任何想法? Thanks谢谢
With dplyr
:使用dplyr
:
df %>%
group_by(id) %>%
arrange(desc(value)) %>%
slice(1)
# # A tibble: 5 x 3
# # Groups: id [5]
# id district value
# <dbl> <dbl> <dbl>
# 1 1 10 10.2
# 2 2 10 10.8
# 3 3 12 9.3
# 4 4 13 6
# 5 5 14 7
There's no real need to distinguish between the max value if there are multiple values and keeping the first value if there are duplicates - if we order the data descending by value
and keep the first row in each id
group, it accomplishes both of those tasks with one logic.如果有多个值,则没有真正需要区分最大值,如果有重复,则保留第一个值 - 如果我们按value
对数据进行降序排序并保留每个id
组中的第一行,则它完成了这两项任务一种逻辑。
library(dplyr)
df %>%
arrange(id, -value) %>%
distinct(id, district, .keep_all = TRUE) %>%
distinct(id, value, .keep_all = TRUE)
id district value
1 1 10 10.2
2 2 10 10.8
3 3 12 9.3
4 4 13 6.0
5 5 14 7.0
First we sort descending by value, then we use the distinct
function to look for unique combinations.首先我们按值降序排序,然后我们使用distinct
函数寻找唯一的组合。
In base R
, we can use duplicated
after order
ing the rows在base R
,我们可以在对行进行order
后使用duplicated
的
df1 <- df[order(df$id, -df$value),]
df1[!duplicated(df1$id),]
# id district value
#1 1 10 10.2
#2 2 10 10.8
#5 3 12 9.3
#6 4 13 6.0
#7 5 14 7.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.