[英]filter duplicated rows that has nonmatching variable values .in R
I am trying to filter some rows that have duplicated and I need the non-matching duplicates to filter.我正在尝试过滤一些重复的行,我需要过滤不匹配的重复项。
Here is the sample dataset.这是示例数据集。
df <- data.frame(
id = c(1,2,2,3,4,5,5,6),
cat = c(3,3,4,5,2,2,1,5),
actual.cat = c(3,4,4,5,2,1,1,7))
> df
id cat actual.cat
1 1 3 3
2 2 3 4
3 2 4 4
4 3 5 5
5 4 2 2
6 5 2 1
7 5 1 1
8 6 5 7
So, each id has cat
and actual.cat
.所以,每个 id 都有cat
和actual.cat
。 When there is a duplicated id
, I need to filter the nonmatching row.当有重复的id
,我需要过滤不匹配的行。
Here what I need.这里有我需要的。
> df
id cat actual.cat
1 3 3
2 3 4
3 5 5
4 2 2
5 2 1
6 5 7
Any ideas on this?对此有何想法?
Thanks!谢谢!
We can do a group by operation and filter
我们可以通过操作和filter
进行分组
library(dplyr)
df %>%
group_by(id) %>%
filter(n() > 1 & cat != actual.cat|n() == 1)
-output -输出
# A tibble: 6 x 3
# Groups: id [6]
# id cat actual.cat
# <dbl> <dbl> <dbl>
#1 1 3 3
#2 2 3 4
#3 3 5 5
#4 4 2 2
#5 5 2 1
#6 6 5 7
Or using base R
或使用base R
subset(df, id %in% names(which(table(id) > 1)) &
cat != actual.cat| id %in% names(which(table(id) == 1)))
In base R, you can use subset
with ave
to select rows in each id
where number of rows in each group is 1 or cat
is not equal to actual.cat
.在基础 R 中,您可以使用带有ave
subset
来选择每个id
中的行,其中每组中的行数为 1 或cat
不等于actual.cat
。
subset(df, ave(cat != actual.cat, id, FUN = function(x) length(x) == 1 | x))
# id cat actual.cat
#1 1 3 3
#2 2 3 4
#4 3 5 5
#5 4 2 2
#6 5 2 1
#8 6 5 7
You can also write this logic in data.table
:您还可以在data.table
编写此逻辑:
library(data.table)
setDT(df)[, .SD[.N == 1 | cat != actual.cat], id]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.