过滤具有不匹配变量值的重复行 .in R

Question

I am trying to filter some rows that have duplicated and I need the non-matching duplicates to filter.我正在尝试过滤一些重复的行，我需要过滤不匹配的重复项。

Here is the sample dataset.这是示例数据集。

df <- data.frame(
         id =  c(1,2,2,3,4,5,5,6),
         cat = c(3,3,4,5,2,2,1,5),
  actual.cat = c(3,4,4,5,2,1,1,7))

> df
  id cat    actual.cat
1  1   3          3
2  2   3          4
3  2   4          4
4  3   5          5
5  4   2          2
6  5   2          1
7  5   1          1
8  6   5          7

So, each id has cat and actual.cat .所以，每个 id 都有cat和actual.cat 。 When there is a duplicated id , I need to filter the nonmatching row.当有重复的id ，我需要过滤不匹配的行。

Here what I need.这里有我需要的。

> df
  id cat     actual.cat
  1   3          3
  2   3          4
  3   5          5
  4   2          2
  5   2          1
  6   5          7

Any ideas on this?对此有何想法？

Thanks!谢谢！

Answer 1

We can do a group by operation and filter我们可以通过操作和filter进行分组

library(dplyr)
df %>% 
     group_by(id) %>%
     filter(n() > 1 & cat != actual.cat|n() == 1)

-output -输出

# A tibble: 6 x 3
# Groups:   id [6]
#     id   cat actual.cat
#  <dbl> <dbl>      <dbl>
#1     1     3          3
#2     2     3          4
#3     3     5          5
#4     4     2          2
#5     5     2          1
#6     6     5          7

Or using base R或使用base R

subset(df, id %in% names(which(table(id) > 1)) & 
     cat != actual.cat| id %in% names(which(table(id) == 1)))

Answer 2

In base R, you can use subset with ave to select rows in each id where number of rows in each group is 1 or cat is not equal to actual.cat .在基础 R 中，您可以使用带有ave subset来选择每个id中的行，其中每组中的行数为 1 或cat不等于actual.cat 。

subset(df, ave(cat != actual.cat, id, FUN = function(x) length(x) == 1 | x))

#  id cat actual.cat
#1  1   3          3
#2  2   3          4
#4  3   5          5
#5  4   2          2
#6  5   2          1
#8  6   5          7

You can also write this logic in data.table :您还可以在data.table编写此逻辑：

library(data.table)
setDT(df)[, .SD[.N == 1 | cat != actual.cat], id]

过滤具有不匹配变量值的重复行 .in R

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-11-02 22:39:46

解决方案2
1 2020-11-03 07:25:52

过滤具有不匹配变量值的重复行 .in R

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-11-02 22:39:46

解决方案2 1 2020-11-03 07:25:52

解决方案1
1 已采纳 2020-11-02 22:39:46

解决方案2
1 2020-11-03 07:25:52